NVIDIA PTX generation and modification
Not sure how low level it is compared to the hardware (likely undocumented?), but it can be generated from C/C++-like GPU languages, modified and reused in a few ways:
OpenCL clGetProgramInfo(program, CL_PROGRAM_BINARIES
+ clCreateProgramWithBinary
: minimal runnable example: How to use clCreateProgramWithBinary in OpenCL?
These are standardized OpenCL API's, which produce and consume implementation defined formats, which in driver version 375.39 for Linux happens to be human readable PTX.
So you can dump the PTX, modify it, and reload.
nvcc
: can compile CUDA GPU-side code to ptx assembly simply with either:
nvcc --ptx a.cu
nvcc
can also compile OpenCL C programs containing both device and host code: Compile and build .cl file using NVIDIA's nvcc Compiler? but I could not find how to get the ptx out with nvcc. Which kind of makes sense since it is just plain C + C strings, and not a magic C superset. This is also suggested by: https://arrayfire.com/generating-ptx-files-from-opencl-code/
And I'm not sure how to recompile the modified PTX and use it as I did with clCreateProgramWithBinary
: How to compile PTX code
Using clGetProgramInfo
, an input CL kernel:
__kernel void kmain(__global int *out) {
out[get_global_id(0)]++;
}
gets compiled to some PTX like:
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-21124049
// Cuda compilation tools, release 8.0, V8.0.44
// Based on LLVM 3.4svn
//
.version 5.0
.target sm_20
.address_size 64
// .globl _Z3incPi
.visible .entry _Z3incPi(
.param .u64 _Z3incPi_param_0
)
{
.reg .pred %p<2>;
.reg .b32 %r<4>;
.reg .b64 %rd<5>;
ld.param.u64 %rd1, [_Z3incPi_param_0];
mov.u32 %r1, %ctaid.x;
setp.gt.s32 %p1, %r1, 2;
@%p1 bra BB0_2;
cvta.to.global.u64 %rd2, %rd1;
mul.wide.s32 %rd3, %r1, 4;
add.s64 %rd4, %rd2, %rd3;
ldu.global.u32 %r2, [%rd4];
add.s32 %r3, %r2, 1;
st.global.u32 [%rd4], %r3;
BB0_2:
ret;
}
Then if for example you modify the line:
add.s32 %r3, %r2, 1;
to:
add.s32 %r3, %r2, 2;
and reuse the PTX modified, it actually increments by 2 instead of 1 as expected.