Thursday, March 14, 2013

Nexus 4 is running with OpenCL programs using aparapi

After the initial OpenCL programs
  clinfo
  helloworld
  helloworld2

on my Nexus 4 phone, I am trying to use some existing Java-OpenCL frameworks (like JOCL, JavCL, OpenCL4Java  and aparapi) for the Android based OpenCL developement.

So aftersome initial studies, fixed on AMD's aparapi and started porting the X86/X86_64 OpenCL-Java framework to any Android-based phone with OpenCL implementation (on GPU).

As part of helloworld2, I have mapped all OpenCL API methods using generic way irrespective of the underlying OpenCL SDK vendor.

Now I ventured to port aparapi from X86/X86_64 to ARM (on android ). To simplify the porting exercise, I took svn release version 68 as starting point for aparapi project.

Need to hack following of the code to accommodate Qualcomm OpenCL SDK on the Android on ARM:
  • Need to use reflection to get sun.misc.Unsafe.
    • Not all Unsafe methods are available on Android version
      • So no float, double, boolean and their corresponding array versions
    • Apache Unsafe internal field names are different than Sun's Unsafe
  • Removed some hard codings on JNI code, so that it works for QUALCOMM SDK also
  • QualComm SDK crashes the Dalvik-VM, if the OpenCL-Kernel is in inner class (and field names contains '$')
  • Current AMD OpenCL code generation depends on class-files availability on classpath. In Android it is dex-file, so need to package the original classpath also into apk file and access it via raw resource

So after all these major changes, the OpenCL program totally written in java worked on my Nexus 4. Once I get time, I will cleanup code and submit my code to my site

Here is the log from my Nexus 4 for this whole exercise (with generated OpenCL in logs):

I/System.out( 2379): getClassBytes Needed for :org/aopencl/aparapi/MyKernel.class , res-id:7f040000
I/System.out( 2379): getClassBytes:java.util.zip.ZipEntry, getting bytes for:org/aopencl/aparapi/MyKernel.class
D/dalvikvm( 2379): GC_CONCURRENT freed 394K, 5% free 9008K/9444K, paused 2ms+3ms, total 16ms
I/System.out( 2379): Got :org/aopencl/aparapi/MyKernel.class, size:-1
I/System.out( 2379): extractBytes:710
D/aopencl ( 2379): Trying shared libraries at following locations:
D/aopencl ( 2379): /system/lib/libOpenCL.so
D/aopencl ( 2379): /system/vendor/lib/egl/libGLES_mali.so
D/aopencl ( 2379): /system/lib/libllvm-a3xx.so
D/aopencl ( 2379): Using the Shared library:/system/lib/libOpenCL.so
E/aparapi ( 2379): platform 0 QUALCOMM
E/aparapi ( 2379): CL_DEVICE_MAX_COMPUTE_UNITS 4
E/aparapi ( 2379): CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS 3
E/aparapi ( 2379): CL_DEVICE_MAX_WORK_GROUP_SIZE 256
E/aparapi ( 2379): CL_DEVICE_GLOBAL_MEM_SIZE 1910623136
E/aparapi ( 2379): CL_DEVICE_LOCAL_MEM_SIZE 1910623168
E/aparapi ( 2379): device[0x722d6d50]: Type:
E/aparapi ( 2379): GPU
E/aparapi ( 2379):
E/aparapi ( 2379):
E/aparapi ( 2379): JNIContext 7289b008
E/aparapi ( 2379): JNIContext valid:1
W/aparapi ( 2379): getExtensions JNIContext :7289b008
W/aparapi ( 2379): getExtensions2 JNIContext :cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_qcom_default_relaxed_math cl_qcom_perf_monitor
D/KernelRunner( 2379): Capabilities initialized to :[cl_qcom_perf_monitor, cl_khr_byte_addressable_store, cl_qcom_default_relaxed_math, cl_khr_local_int32_extended_atomics, cl_khr_local_int32_base_atomics, cl_khr_global_int32_base_atomics, cl_khr_gl_sharing, cl_khr_global_int32_extended_atomics, cl_khr_fp16]
D/KernelRunner( 2379): passed requiresDoublePragma
D/KernelRunner( 2379): passed requiresByteAddressableStorePragma
D/KernelRunner( 2379): passed requiresAtomic32Pragma
D/KernelRunner( 2379): Gen-OpenCL:typedef struct This_s{
D/KernelRunner( 2379): __global int *squares;
D/KernelRunner( 2379): __global int *values;
D/KernelRunner( 2379): int passid;
D/KernelRunner( 2379): }This;
D/KernelRunner( 2379): int get_pass_id(This *this){
D/KernelRunner( 2379): return this->passid;
D/KernelRunner( 2379): }
D/KernelRunner( 2379): __kernel void run(
D/KernelRunner( 2379): __global int *squares,
D/KernelRunner( 2379): __global int *values,
D/KernelRunner( 2379): int passid
D/KernelRunner( 2379): ){
D/KernelRunner( 2379): This thisStruct;
D/KernelRunner( 2379): This* this=&thisStruct;
D/KernelRunner( 2379): this->squares = squares;
D/KernelRunner( 2379): this->values = values;
D/KernelRunner( 2379): this->passid = passid;
D/KernelRunner( 2379): {
D/KernelRunner( 2379): int gid = get_global_id(0);
D/KernelRunner( 2379): this->squares[gid] = this->values[gid] + (this->values[gid] / 2);
D/KernelRunner( 2379): return;
D/KernelRunner( 2379): }
D/KernelRunner( 2379): }
E/aparapi ( 2379): clBuildProgram start clBuildProgram-ptr:722c533d, cl-prgm-ptr:735e4b38, dev-idc:1, devs:735d6ac8
E/aparapi ( 2379): clBuildProgram end:0
D/KernelRunner( 2379): passed buildProgramJNI
D/KernelRunner( 2379): arg 0, squares, type=1688, primitiveSize=4
D/KernelRunner( 2379): arg 1, values, type=1288, primitiveSize=4
E/aparapi ( 2379): in setArgs arg 0 squares type 00001688
E/aparapi ( 2379): in setArgs arg 1 values type 00001288
D/KernelRunner( 2379): saw newArrayRef for squares = [I@4248aa40, newArrayLen = 32
D/KernelRunner( 2379): saw newArrayRef for values = [I@4248a9a8, newArrayLen = 32
E/aparapi ( 2379): for globalSize=32, stepping localSize from 32, returning localSize=32
D/KernelRunner( 2379): Need to resync arrays on org.aopencl.aparapi.MyKernel
E/aparapi ( 2379): got type for squares: 00001688
E/aparapi ( 2379): testing for Resync javaArray squares: old=0x0, new=0x2e900005
E/aparapi ( 2379): Resync javaArray for squares: 0x2e900005 0x0
E/aparapi ( 2379): NewWeakGlobalRef for squares, set to 0x1d600003
E/aparapi ( 2379): updateKernel, args[0].sizeInBytes=128
E/aparapi ( 2379): got type for values: 00001288
E/aparapi ( 2379): testing for Resync javaArray values: old=0x0, new=0x1e300009
E/aparapi ( 2379): Resync javaArray for values: 0x1e300009 0x0
E/aparapi ( 2379): NewWeakGlobalRef for values, set to 0x1d600007
E/aparapi ( 2379): updateKernel, args[1].sizeInBytes=128
E/aparapi ( 2379): back from updateKernel
E/aparapi ( 2379): got type for arg 0, squares, type=00001688
E/aparapi ( 2379): runKernel: arrayOrBuf ref 0x1d600003, oldAddr=0x0, newAddr=0x4248aa50, ref.mem=0x0, isArray=1
E/aparapi ( 2379): at memory addr 0x4248aa50, contents:
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379):
D/aparapi ( 2379): squares 0 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE, size=00000080 bytes, address=4248aa50, &status)
E/aparapi ( 2379): writing buffer 0 squares
E/aparapi ( 2379): got type for arg 1, values, type=00001288
E/aparapi ( 2379): runKernel: arrayOrBuf ref 0x1d600007, oldAddr=0x0, newAddr=0x4248a9b8, ref.mem=0x0, isArray=1
E/aparapi ( 2379): at memory addr 0x4248a9b8, contents:
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379): 01
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379): 00
E/aparapi ( 2379):
D/aparapi ( 2379): values 1 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, size=00000080 bytes, address=4248a9b8, &status)
E/aparapi ( 2379): writing buffer 1 values
E/aparapi ( 2379): reading buffer 0 squares
D/KernelRunner( 2379): executeOpenCL completed. _globalSize=32
I/System.out( 2379): Execution mode=GPU

Friday, March 8, 2013

OpenCL Helloworld program on Nexus 4 device

After getting the CLInfo for Nexus Devices, I tried to map all OpenCL symbols for writing full fledged OpenCL applications on Nexu 4 devices.

As part of this exercise, created mappings to all OpenCL exported functions and wrote one helloword example. With this mapping it is fully possible to write any OpenCL 1.1 specification based programs on C/C++ on Nexus-4 phones.

Code is hosted at aopencl project .

For a change this time used program to create command line executable at Nexus 4 phone. So to execute, one needs Android-NDK needs to be installed along with Android-SDK. More instructions are in README.

BTW, output from the Devices comes likes this:


shell@android:/data/local/tmp $ ./helloworld                                   
Loaded library name:/system/lib/libOpenCL.so
input string: GdkknVnqkc
output string: HelloWorld
shell@android:/data/local/tmp $


As we can see from output, .so file mapped is: /system/lib/libOpenCL.so


Tuesday, March 5, 2013

CLInfo for Nexus 4

Finally I submitted the changed code at  google-code.

All Nexus-4 (possibly Nexus 10) users can try the application in download section at the same place

In future, I will add more programs for Nexus 4 at same location.

Happy coding :)

Monday, March 4, 2013

CLInfo on Nexus 4 phone

Recently bought Nexus 4 phone for the 4 cores and OpenCL support.

So wanted to do something on this phone & found Rahul Garg is doing some R&D on the Nexus devices.

So take the code from there and modified it include more information for getting extra details from Nexus phone.

Here are screen snapshots for CLInfo for Nexus 4:




Here is the extra information that I got from 'adb logcat'.

I/ActivityManager( 510): Start proc org.codedivine.testcln10 for activity org.codedivine.testcln10/.MainActivity: pid=15181 uid=10101 gids={50101, 1028}
D/dalvikvm(15181): Trying to load lib /data/app-lib/org.codedivine.testcln10-1/libhelloCL.so 0x4215d220
D/dalvikvm(15181): Added shared lib /data/app-lib/org.codedivine.testcln10-1/libhelloCL.so 0x4215d220
D/dalvikvm(15181): No JNI_OnLoad found in /data/app-lib/org.codedivine.testcln10-1/libhelloCL.so 0x4215d220, skipping init
D/overlay ( 159): Set pipe=VG1 dpy=0;
D/overlay ( 159): Unset pipe=VG0 dpy=0; Unset pipe=VG1 dpy=0; Unset pipe=RGB1 dpy=0;
D/libEGL (15181): loaded /system/lib/egl/libEGL_adreno200.so
I/CLInfo (15181): CLInfo:QUALCOMM Snapdragon(TM)
I/CLInfo (15181):
I/CLInfo (15181): 0:QUALCOMM Adreno(TM)
I/CLInfo (15181): Max Compute Units: 4
I/CLInfo (15181): Max work group size: 256
I/CLInfo (15181): Max Work Item Dimensions: 3
I/CLInfo (15181): Max work item dimensions: 3
I/CLInfo (15181): Max work items: (256,256,256)
I/CLInfo (15181): Preferred vector width char: 1
I/CLInfo (15181): Preferred vector width short: 1
I/CLInfo (15181): Preferred vector width int: 1
I/CLInfo (15181): Preferred vector width long: 1
I/CLInfo (15181): Preferred vector width float: 1
I/CLInfo (15181): Preferred vector width double: 0
I/CLInfo (15181): Max clock frequency: 325MHz
I/CLInfo (15181): Address bits: 32bits
I/CLInfo (15181): Max memory allocation: 981506048 bytes
I/CLInfo (15181): Image support: True
I/CLInfo (15181): Max size of kernel argument: 256
I/CLInfo (15181): Alignment of base addres: 512 bits
I/CLInfo (15181): Minimum alignment for any datatype: 64 bytes
I/CLInfo (15181): Denorms: False
I/CLInfo (15181): Quiet NaNs: False
I/CLInfo (15181): Round to nearest even: False
I/CLInfo (15181): Round to zero: False
I/CLInfo (15181): Round to +ve and infinity: False
I/CLInfo (15181): IEEE754-2008 fused multiply-add: False
I/CLInfo (15181): Cache type err:0
I/CLInfo (15181): Cache line size: 16 bytes
I/CLInfo (15181): Cache size: 32768 bytes
I/CLInfo (15181): Global memory size: 1963012096 bytes
I/CLInfo (15181): Constant buffer size: 4096 bytes
I/CLInfo (15181): Max number of constant args: 8
I/CLInfo (15181): Local memory type err:0
I/CLInfo (15181): Local memory size: 8192 bytes
I/CLInfo (15181): Profiling timer resolution: 1000
I/CLInfo (15181): Device endianess: True
I/CLInfo (15181): Available: True
I/CLInfo (15181): Compiler available: False
I/CLInfo (15181): Execution capabilities:
I/CLInfo (15181): Execute OpenCL kernels: True
I/CLInfo (15181): Execute native kernels: True
I/CLInfo (15181): Queue properties:
I/CLInfo (15181): Out-of-Order: False
I/CLInfo (15181): Profiling: False
I/CLInfo (15181): Name: QUALCOMM Adreno(TM)
I/CLInfo (15181): Vendor: QUALCOMM
I/CLInfo (15181): Error:CL_DRIVER_VERSION:-30
I/CLInfo (15181): Profile: EMBEDDED_PROFILE
I/CLInfo (15181): Version: OpenCL 1.1 Adreno(TM) 320
I/CLInfo (15181): Extensions: cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_qcom_default_relaxed_math cl_qcom_perf_monitor
I/CLInfo (15181):
I/CLInfo (15181): 1:QUALCOMM Krait(TM)
I/CLInfo (15181): Max Compute Units: 4
I/CLInfo (15181): Max work group size: 1
I/CLInfo (15181): Max Work Item Dimensions: 3
I/CLInfo (15181): Max work item dimensions: 3
I/CLInfo (15181): Max work items: (1,1,1)
I/CLInfo (15181): Preferred vector width char: 16
I/CLInfo (15181): Preferred vector width short: 16
I/CLInfo (15181): Preferred vector width int: 16
I/CLInfo (15181): Preferred vector width long: 1
I/CLInfo (15181): Preferred vector width float: 16
I/CLInfo (15181): Preferred vector width double: 0
I/CLInfo (15181): Max clock frequency: 1500MHz
I/CLInfo (15181): Address bits: 32bits
I/CLInfo (15181): Max memory allocation: 981506048 bytes
I/CLInfo (15181): Image support: True
I/CLInfo (15181): Max size of kernel argument: 256
I/CLInfo (15181): Alignment of base addres: 1024 bits
I/CLInfo (15181): Minimum alignment for any datatype: 128 bytes
I/CLInfo (15181): Denorms: False
I/CLInfo (15181): Quiet NaNs: False
I/CLInfo (15181): Round to nearest even: False
I/CLInfo (15181): Round to zero: False
I/CLInfo (15181): Round to +ve and infinity: False
I/CLInfo (15181): IEEE754-2008 fused multiply-add: False
I/CLInfo (15181): Cache type err:0
I/CLInfo (15181): Cache line size: 64 bytes
I/CLInfo (15181): Cache size: 1048576 bytes
I/CLInfo (15181): Global memory size: 981506048 bytes
I/CLInfo (15181): Constant buffer size: 65536 bytes
I/CLInfo (15181): Max number of constant args: 8
I/CLInfo (15181): Local memory type err:0
I/CLInfo (15181): Local memory size: 32768 bytes
I/CLInfo (15181): Profiling timer resolution: 0
I/CLInfo (15181): Device endianess: True
I/CLInfo (15181): Available: True
I/CLInfo (15181): Compiler available: True
I/CLInfo (15181): Execution capabilities:
I/CLInfo (15181): Execute OpenCL kernels: False
I/CLInfo (15181): Execute native kernels: False
I/CLInfo (15181): Queue properties:
I/CLInfo (15181): Out-of-Order: False
I/CLInfo (15181): Profiling: False
I/CLInfo (15181): Name: QUALCOMM Krait(TM)
I/CLInfo (15181): Vendor: QUALCOMM
I/CLInfo (15181): Error:CL_DRIVER_VERSION:-30
I/CLInfo (15181): Profile: EMBEDDED_PROFILE
I/CLInfo (15181): Version: OpenCL 1.1 Krait(TM)
I/CLInfo (15181): Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cles_khr_int64 cl_khr_fp16




I will post the code for the above shortly.

Special Thanks: