GSS Mahadevan's Blog: 2010

Saturday, July 24, 2010

My Hawkboard Running Angstrom Linux

Recently bought an Hawkboard ( http://www.hawkboard.org/ ) for giving projects to my newphew. Bought the board from http://www.idasystems.net for Rs. 5400/- (including shippings cost to B'lore).

Installed Angstrom Linux with following kernel specs from 'uname -a' :
Linux hawkboard 2.6.32-rc6 #5 PREEMPT Fri May 21 10:37:37 IST 2010 armv5tejl unknown

Booted the device using USB stick for RootFS

Used Baffo-USB Serial port and some crude Null-Modem cable using info from : http://www.lammertbies.nl/comm/info/RS-232_null_modem.html

Made the board to always boot from NAND by following commands from these sites

Booting to Linux
Booting to Linux from NAND

My commands for saving the uImage to NAND are:

setenv bootargs console=ttyS2,115200n8 noinitrd root=/dev/sda1 rootwait rw init=/sbin/init
setenv bootcmd 'setenv bootargs $bootargs;nand read.e 0xc0700000 0x200000 0x200000; bootm c0700000'
saveenv

hawkboard.org > printenv
bootdelay=3
baudrate=115200
bootfile="uImage"
ethaddr=0a:c1:a8:12:fa:c0
filesize=1DBD4C
fileaddr=C0700000
ipaddr=192.168.1.220
serverip=192.168.1.200
bootargs=console=ttyS2,115200n8 noinitrd root=/dev/sda1 rootwait rw init=/sbin/init
bootcmd=setenv bootargs $bootargs;nand read.e 0xc0700000 0x200000 0x200000; bootm c0700000
stdin=serial
stdout=serial
stderr=serial
ver=U-Boot 2009.01 (Dec 22 2009 - 10:04:02)

Environment size: 410/131068 bytes
hawkboard.org >

Wednesday, May 12, 2010

MongoDB Bench Marking (Java) - I

Too see how fast the MongoDB for large scale number crunching stuff, I am fiddling with MongoDB (among the other alternatives). Idea is to see how fast data can be inserted in to MongoDB.

I wrote preliminary c++/java versions and trying do optimizations in java-driver.

So I generated simple schema with 6 Strings as key-value pairs and 250 numbers in Array.

Test setup: AMD Phenom II X4 965 , 4 GB DDR3 RAM, 512 GB SATA HDD, Fedora 12 Linux

Maximum speed that I could achieve in java-version ( driver 1.4) is 1,000,000 records in 26.5 seconds -- which turns around 37,735 records/sec

Total disk size for above data is 2,489 MB -- which turns about around 93.9 MB/sec

My HDD (SATA) is having rate of 120 MB/sec -- via unix's dd-command

Original mongo-driver for java (both 1.4 and 2.0.rc.X releases) have some performance issues, after some hacking around 2 days, performance is reached to above level (3 times). Will shortly publish the changed code at mongodb-site.

Will shortly publish more code/benchmarking data :)

Friday, April 30, 2010

Finally ATI Radeon HD5750, CAL, OpenCL worked in Fedora 12

After lots of trials from Dec'09 till now finally, I am able to setup/install ATI Radeon HD5750 card in Fedroa 12 for X-Server and OpenCL.

Installed latest catalyst version from AMD/ATI: ati-driver-installer-10-4-x86.x86_64.run
Installed following Fedora12 packages(for aiding catalyst kernel module compilation):

root@yyyy x86_64]#rpm -qa|grep kernel |grep 2.6.32.11-99
kernel-2.6.32.11-99.fc12.x86_64
kernel-devel-2.6.32.11-99.fc12.x86_64
kernel-firmware-2.6.32.11-99.fc12.noarch

After installing above, ran catalyst setup and made kernel modules
After kernel modules compilation(setting up blacklisted kernel modules as specified in docs), ran aticonfig.

Here is my setup info

uname

Linux phenom1.localdomain 2.6.32.11-99.fc12.x86_64 #1 SMP Mon Apr 5 19:59:38 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

ATI Stream SDK version

/opt/ati-stream-sdk-v2.01-rhel64

X Server Info

X.Org X Server 1.7.4
xorg-x11-server 1.7.4-6.fc12

ATI Kernel Module

[root@yyyy x86_64]#lsmod |grep fg
fglrx 2349663 32
$ cat /etc/X11/xorg.conf

 Section "ServerLayout"
        Identifier     "aticonfig Layout"
        Screen      0  "aticonfig-Screen[0]-0" 0 0
EndSection

Section "Files"
EndSection

Section "Module"
EndSection

Section "Monitor"
        Identifier   "aticonfig-Monitor[0]-0"
        Option      "VendorName" "ATI Proprietary Driver"
        Option      "ModelName" "Generic Autodetecting Monitor"
        Option      "DPMS" "true"
EndSection

Section "Device"
        Identifier  "Videocard0"
        Driver      "vesa"
EndSection

Section "Device"
        Identifier  "aticonfig-Device[0]-0"
        Driver      "fglrx"
        BusID       "PCI:1:0:0"
EndSection

Section "Screen"
        Identifier "aticonfig-Screen[0]-0"
        Device     "aticonfig-Device[0]-0"
        Monitor    "aticonfig-Monitor[0]-0"
        DefaultDepth     24
        SubSection "Display"
                Viewport   0 0
                Depth     24
        EndSubSection
EndSection

CLInfo output

[root@yyyy x86_64]#./CLInfo 
Number of platforms:                             1
  Plaform Profile:                               FULL_PROFILE
  Plaform Version:                               OpenCL 1.0 ATI-Stream-v2.0.1
  Plaform Name:                                  ATI Stream                  
  Plaform Vendor:                                Advanced Micro Devices, Inc.
  Plaform Extensions:                    cl_khr_icd                          


  Plaform Name:                                  ATI Stream
Number of devices:                               2         
  Device Type:                                   CL_DEVICE_TYPE_CPU
  Device ID:                                     4098              
  Max compute units:                             4                 
  Max work items dimensions:                     3                 
    Max work items[0]:                           1024              
    Max work items[1]:                           1024              
    Max work items[2]:                           1024              
  Max work group size:                           1024              
  Preferred vector width char:                   16                
  Preferred vector width short:                  8                 
  Preferred vector width int:                    4                 
  Preferred vector width long:                   2                 
  Preferred vector width float:                  4                 
  Preferred vector width double:                 0                 
  Max clock frequency:                           3400Mhz           
  Address bits:                                  64                
  Max memeory allocation:                        1073741824        
  Image support:                                 No                
  Max size of kernel argument:                   4096              
  Alignment (bits) of base address:              32768             
  Minimum alignment (bytes) for any datatype:    128               
  Single precision floating point capability                       
    Denorms:                                     Yes               
    Quiet NaNs:                                  Yes               
    Round to nearest even:                       Yes               
    Round to zero:                               No                
    Round to +ve and infinity:                   No                
    IEEE754-2008 fused multiply-add:             No                
  Cache type:                                    Read/Write        
  Cache line size:                               64                
  Cache size:                                    65536             
  Global memory size:                            3221225472        
  Constant buffer size:                          65536             
  Max number of constant args:                   8                 
  Local memory type:                             Global            
  Local memory size:                             32768             
  Profiling timer resolution:                    1                 
  Device endianess:                              Little            
  Available:                                     Yes               
  Compiler available:                            Yes               
  Execution capabilities:                                          
    Execute OpenCL kernels:                      Yes               
    Execute native function:                     No                
  Queue properties:                                                
    Out-of-Order:                                No                
    Profiling :                                  Yes               
  Platform ID:                                   0x7f91992dd4a8    
  Name:                                          AMD Phenom(tm) II X4 965 Processor
  Vendor:                                        AuthenticAMD                      
  Driver version:                                1.0                               
  Profile:                                       FULL_PROFILE                      
  Version:                                       OpenCL 1.0 ATI-Stream-v2.0.1      
  Extensions:                                    cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store                                                                                                               
  Device Type:                                   CL_DEVICE_TYPE_GPU                                                                                                                                  
  Device ID:                                     4098                                                                                                                                                
  Max compute units:                             9                                                                                                                                                   
  Max work items dimensions:                     3                                                                                                                                                   
    Max work items[0]:                           256                                                                                                                                                 
    Max work items[1]:                           256                                                                                                                                                 
    Max work items[2]:                           256                                                                                                                                                 
  Max work group size:                           256                                                                                                                                                 
  Preferred vector width char:                   16                                                                                                                                                  
  Preferred vector width short:                  8                                                                                                                                                   
  Preferred vector width int:                    4                                                                                                                                                   
  Preferred vector width long:                   2                                                                                                                                                   
  Preferred vector width float:                  4
  Preferred vector width double:                 0
  Max clock frequency:                           700Mhz
  Address bits:                                  32
  Max memeory allocation:                        268435456
  Image support:                                 No
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              4096
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     No
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               No
    Round to +ve and infinity:                   No
    IEEE754-2008 fused multiply-add:             No
  Cache type:                                    None
  Cache line size:                               0
  Cache size:                                    0
  Global memory size:                            268435456
  Constant buffer size:                          65536
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             32768
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue properties:
    Out-of-Order:                                No
    Profiling :                                  Yes
  Platform ID:                                   0x7f91992dd4a8
  Name:                                          Juniper
  Vendor:                                        Advanced Micro Devices, Inc.
  Driver version:                                CAL 1.4.635
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.0 ATI-Stream-v2.0.1
  Extensions:                                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics


[root@yyyy x86_64]

Wednesday, April 21, 2010

Optimum global/local work size for a given OpenCL kernel

With on going study of finding better value for Global/Local work size for OpenCL kernels, here is small program in Java using OpenCL library from JavaCL.

Program does check all combinations of Global/Local sizes

Global size's range is GMIN to GMAX

Local size's range is LMIN to LMAX

For ever iteration, Global sizes are multiplied by 2

Time in micro-secs for each loop is noted by computed in

Time taken in nano-secs for each entry is noted bye ns-per-entry

Observations:

Usage: java [-DGMIN=16] [-DGMAX=65536] [-DLMIN=1] [-DLMAX=2] [-DDEBUG=false] com.nativelibs4java.opencl.demos.NDRange2
#Global:      16: Local:  1: computed in :     22477 microsec: entries:        16: ns-per-entry:   1404822
#Global:      16: Local:  2: computed in :      4207 microsec: entries:        16: ns-per-entry:    262987
#Global:      32: Local:  1: computed in :      4172 microsec: entries:        32: ns-per-entry:    130384
#Global:      32: Local:  2: computed in :      4194 microsec: entries:        32: ns-per-entry:    131065
#Global:      64: Local:  1: computed in :      4070 microsec: entries:        64: ns-per-entry:     63603
#Global:      64: Local:  2: computed in :      6431 microsec: entries:        64: ns-per-entry:    100497
#Global:     128: Local:  1: computed in :      4863 microsec: entries:       128: ns-per-entry:     37993
#Global:     128: Local:  2: computed in :      4537 microsec: entries:       128: ns-per-entry:     35446
#Global:     256: Local:  1: computed in :      4079 microsec: entries:       256: ns-per-entry:     15936
#Global:     256: Local:  2: computed in :      7222 microsec: entries:       256: ns-per-entry:     28211
#Global:     512: Local:  1: computed in :      4155 microsec: entries:       512: ns-per-entry:      8116
#Global:     512: Local:  2: computed in :      4095 microsec: entries:       512: ns-per-entry:      7999
#Global:    1024: Local:  1: computed in :      4194 microsec: entries:      1024: ns-per-entry:      4095
#Global:    1024: Local:  2: computed in :      8201 microsec: entries:      1024: ns-per-entry:      8009
#Global:    2048: Local:  1: computed in :      4528 microsec: entries:      2048: ns-per-entry:      2211
#Global:    2048: Local:  2: computed in :      4173 microsec: entries:      2048: ns-per-entry:      2037
#Global:    4096: Local:  1: computed in :      4428 microsec: entries:      4096: ns-per-entry:      1081
#Global:    4096: Local:  2: computed in :      9895 microsec: entries:      4096: ns-per-entry:      2415
#Global:    8192: Local:  1: computed in :      4933 microsec: entries:      8192: ns-per-entry:       602
#Global:    8192: Local:  2: computed in :      5058 microsec: entries:      8192: ns-per-entry:       617
#Global:   16384: Local:  1: computed in :      5595 microsec: entries:     16384: ns-per-entry:       341
#Global:   16384: Local:  2: computed in :     10664 microsec: entries:     16384: ns-per-entry:       650
#Global:   32768: Local:  1: computed in :      7050 microsec: entries:     32768: ns-per-entry:       215
#Global:   32768: Local:  2: computed in :      5615 microsec: entries:     32768: ns-per-entry:       171
#Global:   65536: Local:  1: computed in :     10011 microsec: entries:     65536: ns-per-entry:       152
#Global:   65536: Local:  2: computed in :     13677 microsec: entries:     65536: ns-per-entry:       208

Java Source:

package com.nativelibs4java.opencl.demos;
import static com.nativelibs4java.opencl.JavaCL.createBestContext;
import java.io.*;
import java.nio.*;
import com.nativelibs4java.opencl.*;
import com.nativelibs4java.util.*;
/* This class runs an OpenCL kernel in loops with various combinations of global-size and local-sizes.
 * By varying the global-size and local-size values, one can find out optimum values for global/local sizes
 * for a given kernel.
 * 
 *   @author GSS Mahadevan
 *  */
public class NDRange2 {
 private static final String PRG_NAME="ndrange2";
 private static final int ITEMS=8;// number of ints updated in this kernel
 private static final String usage="Usage: java [-DGMIN=16] [-DGMAX=65536] [-DLMIN=1] [-DLMAX=2] " +
   "[-DDEBUG=false] "+NDRange2.class.getName()+"\n";
 
 private static final String src = "__kernel void "+ PRG_NAME
         + "("
   + "   __global int* output                                             \n"
   + "   )                                           \n"
   + "{                                                                      \n"
   + "   int i = get_global_id(0)*8;                               \n"
   + "   output[i] = get_global_id(0);                                \n"
   + "   output[i+1] = get_global_size(0);                                \n"
   + "   output[i+2] = get_work_dim();                                \n"
   + "   output[i+3] = get_local_id(0);                                \n"
   + "   output[i+4] = get_local_size(0);                                \n"
   + "   output[i+5] = get_group_id(0);                                \n"
   + "   output[i+6] = get_num_groups(0);                                \n"
   + "   output[i+7] = 9999999;                                \n"
   + "}                                                                      \n"
   + "\n";
 private static final int GMIN = Integer.getInteger("GMIN", 16);
 private static final int GMAX = Integer.getInteger("GMAX", 65536);
 
 private static final int LMIN = Integer.getInteger("LMIN", 1);
 private static final int LMAX = Integer.getInteger("LMAX", 2);
 
 private static final boolean DEBUG = Boolean.parseBoolean(System.getProperty("DEBUG", "false"));
 
 private static final int G_SIZE_MAX = GMAX * 8; // multiplied by  just for safety 

 private static IntBuffer output = NIOUtils.directInts(G_SIZE_MAX);
 private static IntBuffer output2 = NIOUtils.directInts(G_SIZE_MAX);
 
 public static class OCL{
  public final CLProgram program;
  public final CLQueue queue;
  public final CLContext context;
  public final CLKernel kernel;
  public OCL(String src,String kernelName) throws CLBuildException{
   SetupUtils.failWithDownloadProposalsIfOpenCLNotAvailable();
   context = createBestContext();
   queue = context.createDefaultQueue();
   program = context.createProgram(src).build();
   kernel = program.createKernel(kernelName);
  }
 }
 public static void main(String[] args) {
  System.out.println(usage);
  try {
   OCL ocl = new OCL(src,PRG_NAME);
   for(int g=GMIN;g <= GMAX; g *= 2){
    for(int l=LMIN;l <= LMAX; l++){
     for (int i = 0; i < G_SIZE_MAX; i++)
      output.put(i, Integer.MIN_VALUE);
     long time = executeKernel(ocl,output,  g, l);
     int count = 0;
     IntBuffer O = output2;
     for (int i = 0; i < G_SIZE_MAX; i++) {
      int v = O.get(i);
      if (v != Integer.MIN_VALUE) {
       count += 8;
       if(DEBUG) System.out.printf("gl_id:%8d(max:%8d), work_dim:%3d: lid:%2d(max:%2d): gr_id:%8d(max:%8d):junk:%8d\n",
                    v,O.get(i+1),O.get(i+2), O.get(i+3), O.get(i+4), O.get(i+5), O.get(i+6),O.get(i+7));
       i += 7;
      } 
     }
     System.out.printf("#Global:%8d: Local:%3d: computed in :%10d microsec: entries:%10d: ns-per-entry:%10d\n",
                     g,l, (time / 1000), count/ITEMS,(time/g));
    }
   }
  } catch (Exception e) {
   System.err.println(e);
   e.printStackTrace();
  }
 }

 private static long executeKernel(OCL ocl, IntBuffer out, int gsize, int lsize)
   throws IOException {
  long startTime = System.nanoTime();
  CLIntBuffer out1 = ocl.context.createIntBuffer(CLMem.Usage.Output, out,false);
  ocl.kernel.setArgs(out1);
  
  CLEvent kernelCompletion = ocl.kernel.enqueueNDRange(ocl.queue, new int[]{gsize},new int[]{lsize });
  kernelCompletion.waitFor();
  ocl.queue.finish();
  // Copy the OpenCL-hosted array back to RAM
  out1.read(ocl.queue, output2, true);
  long time = System.nanoTime() - startTime;
  return time;
 }
}

Information about OpenCL Global size and Local size dimensions

To understand more about global/local work sizes in OpenCL API clEnqueueNDRangeKernel, I wrote small program in Java using nice nativelibs4java library at JavaCL from Olivier Chafik. Some more links on NDRange are:
Understanding NDRange

Java program

package com.nativelibs4java.opencl.demos;

import static com.nativelibs4java.opencl.JavaCL.createBestContext;
import java.io.*;
import java.nio.*;
import com.nativelibs4java.opencl.*;
import com.nativelibs4java.util.*;
/* Usage: java [-DGLOBAL=256] [-DLOCAL=1] com.nativelibs4java.opencl.demos.NDRange1 */
public class NDRange1 {
 private static final String PRG_NAME="ndrange1";
 private static final int ITEMS=8;// number of ints updated in kernel
 
 private static final String src = "__kernel void "+ PRG_NAME
    + "("
 + "   __global int* output                                             \n"
 + "   )                                           \n"
 + "{                                                                      \n"
 + "   int i = get_global_id(0)*8;                               \n"
 + "   output[i] = get_global_id(0);                                \n"
 + "   output[i+1] = get_global_size(0);                                \n"
 + "   output[i+2] = get_work_dim();                                \n"
 + "   output[i+3] = get_local_id(0);                                \n"
 + "   output[i+4] = get_local_size(0);                                \n"
 + "   output[i+5] = get_group_id(0);                                \n"
 + "   output[i+6] = get_num_groups(0);                                \n"
 + "   output[i+7] = 9999999;                                \n"
 + "}                                                                      \n"
 + "\n";
 private static final int G_SIZE = Integer.getInteger("GLOBAL", 256);
 private static final int L_SIZE = Integer.getInteger("LOCAL", 4);
 private static final boolean DEBUG = Boolean.parseBoolean(System.getProperty("DEBUG", "true"));
 
 private static final int G_SIZE_MAX = G_SIZE * 128; // multiplied by  just for safety 

 private static IntBuffer output = NIOUtils.directInts(G_SIZE_MAX);
 private static IntBuffer output2 = NIOUtils.directInts(G_SIZE_MAX);
 public static void main(String[] args) {
  try {
   SetupUtils.failWithDownloadProposalsIfOpenCLNotAvailable();
   for (int i = 0; i < G_SIZE_MAX; i++)
    output.put(i, Integer.MIN_VALUE);

   long time = buildAndExecuteKernel(output, src, G_SIZE, L_SIZE);
   
   int count = 0;
   IntBuffer O = output2;
   for (int i = 0; i < G_SIZE_MAX; i++) {
    int v = O.get(i);
    if (v != Integer.MIN_VALUE) {
     count += 8;
     // junk value is printed to check correct ness
     if(DEBUG) System.out.printf("gl_id:%8d(max:%8d), work_dim:%3d: lid:%2d(max:%2d): gr_id:%8d(max:%8d):junk:%8d\n",
                  v,O.get(i+1),O.get(i+2), O.get(i+3), O.get(i+4), O.get(i+5), O.get(i+6),O.get(i+7));
     i += 7;
    } 
   }
   System.out.printf("#Global:%8d: Local:%3d: computed in :%10d microsec: entries:%10d: ns-per-entry:%10d\n",
                G_SIZE,L_SIZE, (time / 1000), count/ITEMS,(time/G_SIZE));
  } catch (Exception e) {
   System.err.println(e);
   e.printStackTrace();
  }
 }

 private static long buildAndExecuteKernel(IntBuffer out, String src, int gsize, int lsize)
   throws CLBuildException, IOException {
  CLContext context = createBestContext();
  CLQueue queue = context.createDefaultQueue();
  CLProgram program = context.createProgram(src).build();

  CLKernel kernel = program.createKernel(PRG_NAME);
  long startTime = System.nanoTime();
  CLIntBuffer out1 = context.createIntBuffer(CLMem.Usage.Output, out,false);
  kernel.setArgs(out1);

  CLEvent kernelCompletion = kernel.enqueueNDRange(queue, new int[]{gsize},new int[]{lsize });
  kernelCompletion.waitFor();
  queue.finish();
  
  // Copy the OpenCL-hosted array back to RAM
  out1.read(queue, output2, true);
  long time = System.nanoTime() - startTime;
  return time;
 }
}

Program output

java -DGLOBAL=64 -DLOCAL=4 com.nativelibs4java.opencl.demos.NDRange1

gl_id     = get_global_id(0)
max       = get_global_size(0)
work_dim  = get_work_dim()
lid       = get_local_id(0)
max       = get_local_size(0)
gr_id     = get_group_id(0)
max       = get_num_groups(0)

gl_id:       0(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       0(max:      16):junk: 9999999
gl_id:       1(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       0(max:      16):junk: 9999999
gl_id:       2(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       0(max:      16):junk: 9999999
gl_id:       3(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       0(max:      16):junk: 9999999
gl_id:       4(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       1(max:      16):junk: 9999999
gl_id:       5(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       1(max:      16):junk: 9999999
gl_id:       6(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       1(max:      16):junk: 9999999
gl_id:       7(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       1(max:      16):junk: 9999999
gl_id:       8(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       2(max:      16):junk: 9999999
gl_id:       9(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       2(max:      16):junk: 9999999
gl_id:      10(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       2(max:      16):junk: 9999999
gl_id:      11(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       2(max:      16):junk: 9999999
gl_id:      12(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       3(max:      16):junk: 9999999
gl_id:      13(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       3(max:      16):junk: 9999999
gl_id:      14(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       3(max:      16):junk: 9999999
gl_id:      15(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       3(max:      16):junk: 9999999
gl_id:      16(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       4(max:      16):junk: 9999999
gl_id:      17(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       4(max:      16):junk: 9999999
gl_id:      18(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       4(max:      16):junk: 9999999
gl_id:      19(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       4(max:      16):junk: 9999999
gl_id:      20(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       5(max:      16):junk: 9999999
gl_id:      21(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       5(max:      16):junk: 9999999
gl_id:      22(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       5(max:      16):junk: 9999999
gl_id:      23(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       5(max:      16):junk: 9999999
gl_id:      24(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       6(max:      16):junk: 9999999
gl_id:      25(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       6(max:      16):junk: 9999999
gl_id:      26(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       6(max:      16):junk: 9999999
gl_id:      27(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       6(max:      16):junk: 9999999
gl_id:      28(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       7(max:      16):junk: 9999999
gl_id:      29(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       7(max:      16):junk: 9999999
gl_id:      30(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       7(max:      16):junk: 9999999
gl_id:      31(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       7(max:      16):junk: 9999999
gl_id:      32(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       8(max:      16):junk: 9999999
gl_id:      33(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       8(max:      16):junk: 9999999
gl_id:      34(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       8(max:      16):junk: 9999999
gl_id:      35(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       8(max:      16):junk: 9999999
gl_id:      36(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       9(max:      16):junk: 9999999
gl_id:      37(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       9(max:      16):junk: 9999999
gl_id:      38(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       9(max:      16):junk: 9999999
gl_id:      39(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       9(max:      16):junk: 9999999
gl_id:      40(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      10(max:      16):junk: 9999999
gl_id:      41(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      10(max:      16):junk: 9999999
gl_id:      42(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      10(max:      16):junk: 9999999
gl_id:      43(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      10(max:      16):junk: 9999999
gl_id:      44(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      11(max:      16):junk: 9999999
gl_id:      45(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      11(max:      16):junk: 9999999
gl_id:      46(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      11(max:      16):junk: 9999999
gl_id:      47(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      11(max:      16):junk: 9999999
gl_id:      48(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      12(max:      16):junk: 9999999
gl_id:      49(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      12(max:      16):junk: 9999999
gl_id:      50(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      12(max:      16):junk: 9999999
gl_id:      51(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      12(max:      16):junk: 9999999
gl_id:      52(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      13(max:      16):junk: 9999999
gl_id:      53(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      13(max:      16):junk: 9999999
gl_id:      54(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      13(max:      16):junk: 9999999
gl_id:      55(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      13(max:      16):junk: 9999999
gl_id:      56(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      14(max:      16):junk: 9999999
gl_id:      57(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      14(max:      16):junk: 9999999
gl_id:      58(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      14(max:      16):junk: 9999999
gl_id:      59(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      14(max:      16):junk: 9999999
gl_id:      60(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      15(max:      16):junk: 9999999
gl_id:      61(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      15(max:      16):junk: 9999999
gl_id:      62(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      15(max:      16):junk: 9999999
gl_id:      63(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      15(max:      16):junk: 9999999
#Global:      64: Local:  4: computed in :      9519 microsec: entries:        64: ns-per-entry:    148744

Tuesday, March 16, 2010

129$ Plug Computer with WiFi/Bluetooth/SD-card/eSATA with Linux/JVM

Recently Marvell has released new plug computer models were released. New models have Wi-Fi and Bluetooth and these Plug sized computers run Linux also can run Java programs and consume approximately 5W of power. Opportunities to for these plug uses are plenty: Home automation, Robotos, Automobiles solutions, etc.

Here is the descriptions of various models:

Model	Features	Cost ($)
SheevaPlug Dev Kit(Original Plug)	1.2GHz Sheeva Core CPU, 512 MB DDR2 RAM, 512MB NAND Flash,1x Gb Ethernet, 2x USB 2.0, RTC, GPIO, UART	99
GuruPlug Server - Standard	1.2GHz Sheeva Core CPU, 512 MB DDR2 RAM, 512MB NAND Flash,1 x Gb Ethernet, 2 x USB 2.0, RTC, MicroSD card slot, Wi-Fi 802.11 b/g, Bluetooth and U-SNAP I/O for Home Automation using ZigBee or Z-Wave, Linux Kernel 2.6.32, 7 GPIO	100

GuruPlug Server - PLUS	1.2GHz Sheeva Core CPU, 512 MB DDR2 RAM, 512MB NAND Flash,2x Gb Ethernet, 1x eSATA 2.0 port -3Gbps SATAII, 3 x USB 2.0, RTC, MicroSD card slot, Wi-Fi 802.11 b/g, Bluetooth and U-SNAP I/O for Home Automation using ZigBee or Z-Wave, Linux Kernel 2.6.32, 7 GPIO, UART	129
GuruPlug Display	1.2GHz Sheeva Core CPU, 512 MB DDR2 RAM, 2 GB NAND Flash,1x Gb Ethernet, 1 HDMI connector or Touch Panel Display, 3 x USB 2.0, RTC, MicroSD card slot, Wi-Fi 802.11 b/g, Bluetooth , Linux Kernel 2.6.28, 7 GPIO	179

More info visit Globascale site for GuruPlugs
For old Plug info here
BTW, I am not any way related to Globalscale/ Marvell :)

Wednesday, March 10, 2010

Latest LZO Compression implemented in Java

I have ported the Mini LZO (version 2.03) compression utility in 'c' to pure java.

Initial version of the port is here.

You can try the example using this sample program

You may refer original Java-LZO version (done in way back 1999) here. In the Original only decompression is implemented.

Current ported java-minilzo-jar has implemented following methods: compression, decompression and decompression_safe. At present Compression & decompression of zero-filled-data and random-data is working properly. I will be adding more utilities and API for end-users shortly.

Many many thanks to Markus F.X.J. Oberhumer for his excellent minilzo.c (which is the base for current java implementation)

Monday, March 8, 2010

Command line options of clc.exe (ATI Stream SDK)

Save the following in a shell script


f=$TMP/ccc.txt
clf=$TMP/q2.cl

cat >$clf << EOF
__kernel void helloWorld() {
    size_t i =  get_global_id(0);
}

EOF


>$f
(for i in a b c d e f g h i j k l m n o p q r s t u v w z y z;do ./clc --$i $clf;done) >>$f 2>&1

grep "\-\-" $f |grep -vE "^Warning:|\""

So by running above script, here are the all possible command line options for clc.exe(ATI Stream OpenCL SDK 2.01)


            --anachronisms
            --auto_instantiation
            --alternative_tokens
            --array_new_and_delete
            --arg_dep_lookup
            --addrspace_cast
            --brief_diagnostics
            --building_runtime
            --bool
            --base_assign_op_is_default
            --dependencies
            --definition_list_file
            --dollar
            --define_macro
            --db_ocl
            --db_name
            --diag_suppress
            --diag_remark
            --diag_warning
            --diag_error
            --diag_once
            --display_error_number
            --distinct_template_signatures
            --designators
            --dep_name
            --defer_parse_function_templates
            --default_calling_convention
            --dump_configuration
            --debuginfo
            --exported_template_file
            --exceptions
            --error_limit
            --error_output
            --explicit
            --extern_inline
            --embedded_c++
            --enum_overloading
            --early_tiebreaker
            --extended_designators
            --extended_variadic_macros
            --export
            --edg_base_dir
            --embedded_c
            --emit
            --force_vtbl
            --far_data_pointers
            --far_code_pointers
            --for_init_diff_warning
            --friend_injection
            --guiding_decls
            --gcc
            --g++
            --gnu_version
            --instantiate
            --ii_file
            --implicit_include
            --include_directory
            --inlining
            --implicit_typename
            --implicit_extern_c_type_conversion
            --import_dir
            --incl_suffixes
            --ignore_std
            --list
            --long_lifetime_temps
            --long_preserving_rules
            --late_tiebreaker
            --long_long
            --list_macros
            --llvm_builtin
            --module_init
            --msvc_target_version
            --microsoft
            --microsoft_version
            --microsoft_bugs
            --microsoft_16
            --multibyte_chars
            --mmmx
            --msse
            --msse2
            --msse3
            --mssse3
            --msse4.1
            --msse4.2
            --msse5
            --march
            --no_line_commands
            --no_anachronisms
            --no_code_gen
            --no_auto_instantiation
            --no_implicit_include
            --no_warnings
            --no_exceptions
            --no_use_before_set_warnings
            --no_display_error_number
            --no_pch_messages
            --no_pch_verbose
            --no_restrict
            --no_microsoft
            --no_microsoft_bugs
            --near_data_pointers
            --near_code_pointers
            --no_wchar_t_keyword
            --no_alternative_tokens
            --no_inlining
            --no_svr4
            --no_brief_diagnostics
            --nonconst_ref_anachronism
            --no_nonconst_ref_anachronism
            --no_preproc_only
            --no_rtti
            --no_bool
            --no_array_new_and_delete
            --no_explicit
            --namespaces
            --no_namespaces
            --no_using_std
            --no_remove_unneeded_entities
            --no_typename
            --no_implicit_typename
            --no_special_subscript_cost
            --new_for_init
            --no_for_init_diff_warning
            --no_distinct_template_signatures
            --no_guiding_decls
            --no_old_specializations
            --no_wrap_diagnostics
            --no_implicit_extern_c_type_conversion
            --no_long_preserving_rules
            --no_extern_inline
            --no_multibyte_chars
            --no_vla
            --no_enum_overloading
            --nonstd_qualifier_deduction
            --no_nonstd_qualifier_deduction
            --no_const_string_literals
            --no_class_name_injection
            --no_arg_dep_lookup
            --no_friend_injection
            --nonstd_using_decl
            --no_nonstd_using_decl
            --no_designators
            --no_extended_designators
            --no_variadic_macros
            --no_extended_variadic_macros
            --no_compound_literals
            --no_base_assign_op_is_default
            --no_dep_name
            --no_parse_templates
            --no_c99
            --no_export
            --no_stdarg_builtin
            --no_gcc
            --no_g++
            --named_address_spaces
            --no_named_address_spaces
            --no_embedded_c
            --no_trigraphs
            --nonstd_default_arg_deduction
            --no_nonstd_default_arg_deduction
            --no_stdc_zero_in_system_headers
            --no_template_typedefs_in_diagnostics
            --no_defer_parse_function_templates
            --no_uliterals
            --no_type_traits_helpers
            --no_c++0x
            --no_check_concatenations
            --noopencl
            --old_line_commands
            --old_c
            --output
            --old_style_preprocessing
            --old_for_init
            --old_specializations
            --preprocess
            --pch
            --pch_messages
            --pch_verbose
            --pch_dir
            --pack_alignment
            --preinclude
            --preinclude_macros
            --pending_instantiations
            --parse_templates
            --remarks
            --restrict
            --rtti
            --remove_unneeded_entities
            --report_gnu_extensions
            --strict
            --strict_warnings
            --signed_chars
            --suppress_instantiation_flags
            --suppress_vtbl
            --short_lifetime_temps
            --svr4
            --special_subscript_cost
            --sys_include
            --short_enums
            --set_flag
            --stdc_zero_in_system_headers
            --signed_bit_fields
            --single_precision_constant
            --trace_includes
            --template_info_file
            --template_directory
            --timing
            --typename
            --trigraphs
            --template_typedefs_in_diagnostics
            --type_traits_helpers
            --unsigned_chars
            --undefine_macro
            --use_pch
            --using_std
            --uliterals
            --unsigned_bit_fields
            --unicode_source_kind
            --version
            --vla
            --variadic_macros
            --varsubscript
            --wchar_t_keyword
            --wrap_diagnostics
            --werror

GSS Mahadevan's Blog