Friday, April 30, 2010

Finally ATI Radeon HD5750, CAL, OpenCL worked in Fedora 12

After lots of trials from Dec'09 till now finally, I am able to setup/install ATI Radeon HD5750 card in Fedroa 12 for X-Server and OpenCL.

Installed  latest catalyst  version from AMD/ATI:   ati-driver-installer-10-4-x86.x86_64.run
Installed following Fedora12 packages(for aiding catalyst kernel module compilation):

root@yyyy x86_64]#rpm -qa|grep kernel |grep 2.6.32.11-99
kernel-2.6.32.11-99.fc12.x86_64
kernel-devel-2.6.32.11-99.fc12.x86_64
kernel-firmware-2.6.32.11-99.fc12.noarch

After installing above, ran catalyst setup and made kernel modules
After kernel modules compilation(setting up blacklisted kernel modules as specified in docs), ran aticonfig.


Here is my setup info



uname


Linux phenom1.localdomain 2.6.32.11-99.fc12.x86_64 #1 SMP Mon Apr 5 19:59:38 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux


ATI Stream SDK version
 /opt/ati-stream-sdk-v2.01-rhel64

X Server Info

X.Org X Server 1.7.4
xorg-x11-server 1.7.4-6.fc12


ATI Kernel Module

[root@yyyy x86_64]#lsmod |grep fg
fglrx 2349663 32
$ cat /etc/X11/xorg.conf

 Section "ServerLayout"
        Identifier     "aticonfig Layout"
        Screen      0  "aticonfig-Screen[0]-0" 0 0
EndSection

Section "Files"
EndSection

Section "Module"
EndSection

Section "Monitor"
        Identifier   "aticonfig-Monitor[0]-0"
        Option      "VendorName" "ATI Proprietary Driver"
        Option      "ModelName" "Generic Autodetecting Monitor"
        Option      "DPMS" "true"
EndSection

Section "Device"
        Identifier  "Videocard0"
        Driver      "vesa"
EndSection

Section "Device"
        Identifier  "aticonfig-Device[0]-0"
        Driver      "fglrx"
        BusID       "PCI:1:0:0"
EndSection

Section "Screen"
        Identifier "aticonfig-Screen[0]-0"
        Device     "aticonfig-Device[0]-0"
        Monitor    "aticonfig-Monitor[0]-0"
        DefaultDepth     24
        SubSection "Display"
                Viewport   0 0
                Depth     24
        EndSubSection
EndSection




CLInfo output

[root@yyyy x86_64]#./CLInfo 
Number of platforms:                             1
  Plaform Profile:                               FULL_PROFILE
  Plaform Version:                               OpenCL 1.0 ATI-Stream-v2.0.1
  Plaform Name:                                  ATI Stream                  
  Plaform Vendor:                                Advanced Micro Devices, Inc.
  Plaform Extensions:                    cl_khr_icd                          


  Plaform Name:                                  ATI Stream
Number of devices:                               2         
  Device Type:                                   CL_DEVICE_TYPE_CPU
  Device ID:                                     4098              
  Max compute units:                             4                 
  Max work items dimensions:                     3                 
    Max work items[0]:                           1024              
    Max work items[1]:                           1024              
    Max work items[2]:                           1024              
  Max work group size:                           1024              
  Preferred vector width char:                   16                
  Preferred vector width short:                  8                 
  Preferred vector width int:                    4                 
  Preferred vector width long:                   2                 
  Preferred vector width float:                  4                 
  Preferred vector width double:                 0                 
  Max clock frequency:                           3400Mhz           
  Address bits:                                  64                
  Max memeory allocation:                        1073741824        
  Image support:                                 No                
  Max size of kernel argument:                   4096              
  Alignment (bits) of base address:              32768             
  Minimum alignment (bytes) for any datatype:    128               
  Single precision floating point capability                       
    Denorms:                                     Yes               
    Quiet NaNs:                                  Yes               
    Round to nearest even:                       Yes               
    Round to zero:                               No                
    Round to +ve and infinity:                   No                
    IEEE754-2008 fused multiply-add:             No                
  Cache type:                                    Read/Write        
  Cache line size:                               64                
  Cache size:                                    65536             
  Global memory size:                            3221225472        
  Constant buffer size:                          65536             
  Max number of constant args:                   8                 
  Local memory type:                             Global            
  Local memory size:                             32768             
  Profiling timer resolution:                    1                 
  Device endianess:                              Little            
  Available:                                     Yes               
  Compiler available:                            Yes               
  Execution capabilities:                                          
    Execute OpenCL kernels:                      Yes               
    Execute native function:                     No                
  Queue properties:                                                
    Out-of-Order:                                No                
    Profiling :                                  Yes               
  Platform ID:                                   0x7f91992dd4a8    
  Name:                                          AMD Phenom(tm) II X4 965 Processor
  Vendor:                                        AuthenticAMD                      
  Driver version:                                1.0                               
  Profile:                                       FULL_PROFILE                      
  Version:                                       OpenCL 1.0 ATI-Stream-v2.0.1      
  Extensions:                                    cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store                                                                                                               
  Device Type:                                   CL_DEVICE_TYPE_GPU                                                                                                                                  
  Device ID:                                     4098                                                                                                                                                
  Max compute units:                             9                                                                                                                                                   
  Max work items dimensions:                     3                                                                                                                                                   
    Max work items[0]:                           256                                                                                                                                                 
    Max work items[1]:                           256                                                                                                                                                 
    Max work items[2]:                           256                                                                                                                                                 
  Max work group size:                           256                                                                                                                                                 
  Preferred vector width char:                   16                                                                                                                                                  
  Preferred vector width short:                  8                                                                                                                                                   
  Preferred vector width int:                    4                                                                                                                                                   
  Preferred vector width long:                   2                                                                                                                                                   
  Preferred vector width float:                  4
  Preferred vector width double:                 0
  Max clock frequency:                           700Mhz
  Address bits:                                  32
  Max memeory allocation:                        268435456
  Image support:                                 No
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              4096
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     No
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               No
    Round to +ve and infinity:                   No
    IEEE754-2008 fused multiply-add:             No
  Cache type:                                    None
  Cache line size:                               0
  Cache size:                                    0
  Global memory size:                            268435456
  Constant buffer size:                          65536
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             32768
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue properties:
    Out-of-Order:                                No
    Profiling :                                  Yes
  Platform ID:                                   0x7f91992dd4a8
  Name:                                          Juniper
  Vendor:                                        Advanced Micro Devices, Inc.
  Driver version:                                CAL 1.4.635
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.0 ATI-Stream-v2.0.1
  Extensions:                                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics


[root@yyyy x86_64]

Wednesday, April 21, 2010

Optimum global/local work size for a given OpenCL kernel

With on going study of finding better value for Global/Local work size for OpenCL kernels, here is small program in Java using OpenCL library from JavaCL.

  • Program does check all combinations of Global/Local sizes
  • Global size's range is GMIN to GMAX
  • Local size's range is LMIN to LMAX
  • For ever iteration, Global sizes are multiplied by 2

  • Time in micro-secs for each loop is noted by computed in
  • Time taken in nano-secs for each entry is noted bye ns-per-entry
  • Observations:

    Usage: java [-DGMIN=16] [-DGMAX=65536] [-DLMIN=1] [-DLMAX=2] [-DDEBUG=false] com.nativelibs4java.opencl.demos.NDRange2
    #Global:      16: Local:  1: computed in :     22477 microsec: entries:        16: ns-per-entry:   1404822
    #Global:      16: Local:  2: computed in :      4207 microsec: entries:        16: ns-per-entry:    262987
    #Global:      32: Local:  1: computed in :      4172 microsec: entries:        32: ns-per-entry:    130384
    #Global:      32: Local:  2: computed in :      4194 microsec: entries:        32: ns-per-entry:    131065
    #Global:      64: Local:  1: computed in :      4070 microsec: entries:        64: ns-per-entry:     63603
    #Global:      64: Local:  2: computed in :      6431 microsec: entries:        64: ns-per-entry:    100497
    #Global:     128: Local:  1: computed in :      4863 microsec: entries:       128: ns-per-entry:     37993
    #Global:     128: Local:  2: computed in :      4537 microsec: entries:       128: ns-per-entry:     35446
    #Global:     256: Local:  1: computed in :      4079 microsec: entries:       256: ns-per-entry:     15936
    #Global:     256: Local:  2: computed in :      7222 microsec: entries:       256: ns-per-entry:     28211
    #Global:     512: Local:  1: computed in :      4155 microsec: entries:       512: ns-per-entry:      8116
    #Global:     512: Local:  2: computed in :      4095 microsec: entries:       512: ns-per-entry:      7999
    #Global:    1024: Local:  1: computed in :      4194 microsec: entries:      1024: ns-per-entry:      4095
    #Global:    1024: Local:  2: computed in :      8201 microsec: entries:      1024: ns-per-entry:      8009
    #Global:    2048: Local:  1: computed in :      4528 microsec: entries:      2048: ns-per-entry:      2211
    #Global:    2048: Local:  2: computed in :      4173 microsec: entries:      2048: ns-per-entry:      2037
    #Global:    4096: Local:  1: computed in :      4428 microsec: entries:      4096: ns-per-entry:      1081
    #Global:    4096: Local:  2: computed in :      9895 microsec: entries:      4096: ns-per-entry:      2415
    #Global:    8192: Local:  1: computed in :      4933 microsec: entries:      8192: ns-per-entry:       602
    #Global:    8192: Local:  2: computed in :      5058 microsec: entries:      8192: ns-per-entry:       617
    #Global:   16384: Local:  1: computed in :      5595 microsec: entries:     16384: ns-per-entry:       341
    #Global:   16384: Local:  2: computed in :     10664 microsec: entries:     16384: ns-per-entry:       650
    #Global:   32768: Local:  1: computed in :      7050 microsec: entries:     32768: ns-per-entry:       215
    #Global:   32768: Local:  2: computed in :      5615 microsec: entries:     32768: ns-per-entry:       171
    #Global:   65536: Local:  1: computed in :     10011 microsec: entries:     65536: ns-per-entry:       152
    #Global:   65536: Local:  2: computed in :     13677 microsec: entries:     65536: ns-per-entry:       208
    
    

    Java Source:

    package com.nativelibs4java.opencl.demos;
    import static com.nativelibs4java.opencl.JavaCL.createBestContext;
    import java.io.*;
    import java.nio.*;
    import com.nativelibs4java.opencl.*;
    import com.nativelibs4java.util.*;
    /* This class runs an OpenCL kernel in loops with various combinations of global-size and local-sizes.
     * By varying the global-size and local-size values, one can find out optimum values for global/local sizes
     * for a given kernel.
     * 
     *   @author GSS Mahadevan
     *  */
    public class NDRange2 {
     private static final String PRG_NAME="ndrange2";
     private static final int ITEMS=8;// number of ints updated in this kernel
     private static final String usage="Usage: java [-DGMIN=16] [-DGMAX=65536] [-DLMIN=1] [-DLMAX=2] " +
       "[-DDEBUG=false] "+NDRange2.class.getName()+"\n";
     
     private static final String src = "__kernel void "+ PRG_NAME
             + "("
       + "   __global int* output                                             \n"
       + "   )                                           \n"
       + "{                                                                      \n"
       + "   int i = get_global_id(0)*8;                               \n"
       + "   output[i] = get_global_id(0);                                \n"
       + "   output[i+1] = get_global_size(0);                                \n"
       + "   output[i+2] = get_work_dim();                                \n"
       + "   output[i+3] = get_local_id(0);                                \n"
       + "   output[i+4] = get_local_size(0);                                \n"
       + "   output[i+5] = get_group_id(0);                                \n"
       + "   output[i+6] = get_num_groups(0);                                \n"
       + "   output[i+7] = 9999999;                                \n"
       + "}                                                                      \n"
       + "\n";
     private static final int GMIN = Integer.getInteger("GMIN", 16);
     private static final int GMAX = Integer.getInteger("GMAX", 65536);
     
     private static final int LMIN = Integer.getInteger("LMIN", 1);
     private static final int LMAX = Integer.getInteger("LMAX", 2);
     
     private static final boolean DEBUG = Boolean.parseBoolean(System.getProperty("DEBUG", "false"));
     
     private static final int G_SIZE_MAX = GMAX * 8; // multiplied by  just for safety 
    
     private static IntBuffer output = NIOUtils.directInts(G_SIZE_MAX);
     private static IntBuffer output2 = NIOUtils.directInts(G_SIZE_MAX);
     
     public static class OCL{
      public final CLProgram program;
      public final CLQueue queue;
      public final CLContext context;
      public final CLKernel kernel;
      public OCL(String src,String kernelName) throws CLBuildException{
       SetupUtils.failWithDownloadProposalsIfOpenCLNotAvailable();
       context = createBestContext();
       queue = context.createDefaultQueue();
       program = context.createProgram(src).build();
       kernel = program.createKernel(kernelName);
      }
     }
     public static void main(String[] args) {
      System.out.println(usage);
      try {
       OCL ocl = new OCL(src,PRG_NAME);
       for(int g=GMIN;g <= GMAX; g *= 2){
        for(int l=LMIN;l <= LMAX; l++){
         for (int i = 0; i < G_SIZE_MAX; i++)
          output.put(i, Integer.MIN_VALUE);
         long time = executeKernel(ocl,output,  g, l);
         int count = 0;
         IntBuffer O = output2;
         for (int i = 0; i < G_SIZE_MAX; i++) {
          int v = O.get(i);
          if (v != Integer.MIN_VALUE) {
           count += 8;
           if(DEBUG) System.out.printf("gl_id:%8d(max:%8d), work_dim:%3d: lid:%2d(max:%2d): gr_id:%8d(max:%8d):junk:%8d\n",
                        v,O.get(i+1),O.get(i+2), O.get(i+3), O.get(i+4), O.get(i+5), O.get(i+6),O.get(i+7));
           i += 7;
          } 
         }
         System.out.printf("#Global:%8d: Local:%3d: computed in :%10d microsec: entries:%10d: ns-per-entry:%10d\n",
                         g,l, (time / 1000), count/ITEMS,(time/g));
        }
       }
      } catch (Exception e) {
       System.err.println(e);
       e.printStackTrace();
      }
     }
    
     private static long executeKernel(OCL ocl, IntBuffer out, int gsize, int lsize)
       throws IOException {
      long startTime = System.nanoTime();
      CLIntBuffer out1 = ocl.context.createIntBuffer(CLMem.Usage.Output, out,false);
      ocl.kernel.setArgs(out1);
      
      CLEvent kernelCompletion = ocl.kernel.enqueueNDRange(ocl.queue, new int[]{gsize},new int[]{lsize });
      kernelCompletion.waitFor();
      ocl.queue.finish();
      // Copy the OpenCL-hosted array back to RAM
      out1.read(ocl.queue, output2, true);
      long time = System.nanoTime() - startTime;
      return time;
     }
    }
    

    Information about OpenCL Global size and Local size dimensions

    To understand more about global/local work sizes in OpenCL API clEnqueueNDRangeKernel, I wrote small program in Java using nice nativelibs4java library at JavaCL from Olivier Chafik. Some more links on NDRange are:
    Understanding NDRange

    Java program

    package com.nativelibs4java.opencl.demos;
    
    import static com.nativelibs4java.opencl.JavaCL.createBestContext;
    import java.io.*;
    import java.nio.*;
    import com.nativelibs4java.opencl.*;
    import com.nativelibs4java.util.*;
    /* Usage: java [-DGLOBAL=256] [-DLOCAL=1] com.nativelibs4java.opencl.demos.NDRange1 */
    public class NDRange1 {
     private static final String PRG_NAME="ndrange1";
     private static final int ITEMS=8;// number of ints updated in kernel
     
     private static final String src = "__kernel void "+ PRG_NAME
        + "("
     + "   __global int* output                                             \n"
     + "   )                                           \n"
     + "{                                                                      \n"
     + "   int i = get_global_id(0)*8;                               \n"
     + "   output[i] = get_global_id(0);                                \n"
     + "   output[i+1] = get_global_size(0);                                \n"
     + "   output[i+2] = get_work_dim();                                \n"
     + "   output[i+3] = get_local_id(0);                                \n"
     + "   output[i+4] = get_local_size(0);                                \n"
     + "   output[i+5] = get_group_id(0);                                \n"
     + "   output[i+6] = get_num_groups(0);                                \n"
     + "   output[i+7] = 9999999;                                \n"
     + "}                                                                      \n"
     + "\n";
     private static final int G_SIZE = Integer.getInteger("GLOBAL", 256);
     private static final int L_SIZE = Integer.getInteger("LOCAL", 4);
     private static final boolean DEBUG = Boolean.parseBoolean(System.getProperty("DEBUG", "true"));
     
     private static final int G_SIZE_MAX = G_SIZE * 128; // multiplied by  just for safety 
    
     private static IntBuffer output = NIOUtils.directInts(G_SIZE_MAX);
     private static IntBuffer output2 = NIOUtils.directInts(G_SIZE_MAX);
     public static void main(String[] args) {
      try {
       SetupUtils.failWithDownloadProposalsIfOpenCLNotAvailable();
       for (int i = 0; i < G_SIZE_MAX; i++)
        output.put(i, Integer.MIN_VALUE);
    
       long time = buildAndExecuteKernel(output, src, G_SIZE, L_SIZE);
       
       int count = 0;
       IntBuffer O = output2;
       for (int i = 0; i < G_SIZE_MAX; i++) {
        int v = O.get(i);
        if (v != Integer.MIN_VALUE) {
         count += 8;
         // junk value is printed to check correct ness
         if(DEBUG) System.out.printf("gl_id:%8d(max:%8d), work_dim:%3d: lid:%2d(max:%2d): gr_id:%8d(max:%8d):junk:%8d\n",
                      v,O.get(i+1),O.get(i+2), O.get(i+3), O.get(i+4), O.get(i+5), O.get(i+6),O.get(i+7));
         i += 7;
        } 
       }
       System.out.printf("#Global:%8d: Local:%3d: computed in :%10d microsec: entries:%10d: ns-per-entry:%10d\n",
                    G_SIZE,L_SIZE, (time / 1000), count/ITEMS,(time/G_SIZE));
      } catch (Exception e) {
       System.err.println(e);
       e.printStackTrace();
      }
     }
    
     private static long buildAndExecuteKernel(IntBuffer out, String src, int gsize, int lsize)
       throws CLBuildException, IOException {
      CLContext context = createBestContext();
      CLQueue queue = context.createDefaultQueue();
      CLProgram program = context.createProgram(src).build();
    
      CLKernel kernel = program.createKernel(PRG_NAME);
      long startTime = System.nanoTime();
      CLIntBuffer out1 = context.createIntBuffer(CLMem.Usage.Output, out,false);
      kernel.setArgs(out1);
    
      CLEvent kernelCompletion = kernel.enqueueNDRange(queue, new int[]{gsize},new int[]{lsize });
      kernelCompletion.waitFor();
      queue.finish();
      
      // Copy the OpenCL-hosted array back to RAM
      out1.read(queue, output2, true);
      long time = System.nanoTime() - startTime;
      return time;
     }
    }
    

    Program output

    java -DGLOBAL=64 -DLOCAL=4 com.nativelibs4java.opencl.demos.NDRange1
    
    gl_id     = get_global_id(0)
    max       = get_global_size(0)
    work_dim  = get_work_dim()
    lid       = get_local_id(0)
    max       = get_local_size(0)
    gr_id     = get_group_id(0)
    max       = get_num_groups(0)
    
    gl_id:       0(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       0(max:      16):junk: 9999999
    gl_id:       1(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       0(max:      16):junk: 9999999
    gl_id:       2(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       0(max:      16):junk: 9999999
    gl_id:       3(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       0(max:      16):junk: 9999999
    gl_id:       4(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       1(max:      16):junk: 9999999
    gl_id:       5(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       1(max:      16):junk: 9999999
    gl_id:       6(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       1(max:      16):junk: 9999999
    gl_id:       7(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       1(max:      16):junk: 9999999
    gl_id:       8(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       2(max:      16):junk: 9999999
    gl_id:       9(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       2(max:      16):junk: 9999999
    gl_id:      10(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       2(max:      16):junk: 9999999
    gl_id:      11(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       2(max:      16):junk: 9999999
    gl_id:      12(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       3(max:      16):junk: 9999999
    gl_id:      13(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       3(max:      16):junk: 9999999
    gl_id:      14(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       3(max:      16):junk: 9999999
    gl_id:      15(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       3(max:      16):junk: 9999999
    gl_id:      16(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       4(max:      16):junk: 9999999
    gl_id:      17(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       4(max:      16):junk: 9999999
    gl_id:      18(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       4(max:      16):junk: 9999999
    gl_id:      19(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       4(max:      16):junk: 9999999
    gl_id:      20(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       5(max:      16):junk: 9999999
    gl_id:      21(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       5(max:      16):junk: 9999999
    gl_id:      22(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       5(max:      16):junk: 9999999
    gl_id:      23(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       5(max:      16):junk: 9999999
    gl_id:      24(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       6(max:      16):junk: 9999999
    gl_id:      25(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       6(max:      16):junk: 9999999
    gl_id:      26(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       6(max:      16):junk: 9999999
    gl_id:      27(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       6(max:      16):junk: 9999999
    gl_id:      28(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       7(max:      16):junk: 9999999
    gl_id:      29(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       7(max:      16):junk: 9999999
    gl_id:      30(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       7(max:      16):junk: 9999999
    gl_id:      31(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       7(max:      16):junk: 9999999
    gl_id:      32(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       8(max:      16):junk: 9999999
    gl_id:      33(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       8(max:      16):junk: 9999999
    gl_id:      34(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       8(max:      16):junk: 9999999
    gl_id:      35(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       8(max:      16):junk: 9999999
    gl_id:      36(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:       9(max:      16):junk: 9999999
    gl_id:      37(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:       9(max:      16):junk: 9999999
    gl_id:      38(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:       9(max:      16):junk: 9999999
    gl_id:      39(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:       9(max:      16):junk: 9999999
    gl_id:      40(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      10(max:      16):junk: 9999999
    gl_id:      41(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      10(max:      16):junk: 9999999
    gl_id:      42(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      10(max:      16):junk: 9999999
    gl_id:      43(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      10(max:      16):junk: 9999999
    gl_id:      44(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      11(max:      16):junk: 9999999
    gl_id:      45(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      11(max:      16):junk: 9999999
    gl_id:      46(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      11(max:      16):junk: 9999999
    gl_id:      47(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      11(max:      16):junk: 9999999
    gl_id:      48(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      12(max:      16):junk: 9999999
    gl_id:      49(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      12(max:      16):junk: 9999999
    gl_id:      50(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      12(max:      16):junk: 9999999
    gl_id:      51(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      12(max:      16):junk: 9999999
    gl_id:      52(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      13(max:      16):junk: 9999999
    gl_id:      53(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      13(max:      16):junk: 9999999
    gl_id:      54(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      13(max:      16):junk: 9999999
    gl_id:      55(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      13(max:      16):junk: 9999999
    gl_id:      56(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      14(max:      16):junk: 9999999
    gl_id:      57(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      14(max:      16):junk: 9999999
    gl_id:      58(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      14(max:      16):junk: 9999999
    gl_id:      59(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      14(max:      16):junk: 9999999
    gl_id:      60(max:      64), work_dim:  1: lid: 0(max: 4): gr_id:      15(max:      16):junk: 9999999
    gl_id:      61(max:      64), work_dim:  1: lid: 1(max: 4): gr_id:      15(max:      16):junk: 9999999
    gl_id:      62(max:      64), work_dim:  1: lid: 2(max: 4): gr_id:      15(max:      16):junk: 9999999
    gl_id:      63(max:      64), work_dim:  1: lid: 3(max: 4): gr_id:      15(max:      16):junk: 9999999
    #Global:      64: Local:  4: computed in :      9519 microsec: entries:        64: ns-per-entry:    148744