How to create or manipulate GPU assembler?

Question

Does any one have experience in creating/manipulating GPU machine code, possibly at run-time?

I am interested in modifying GPU assembler code, possibly at run time with minimal overhead. Specifically I'm interested in assembler based genetic programming.

I understand ATI has released ISAs for some of their cards, and nvidia recently released a disassembler for CUDA for older cards, but I am not sure if it is possible to modify instructions in memory at runtime or even before hand.

Is this possible? Any related information is welcome.

Do you have a link for the disassembler recently released by nvidia ? All I find is "decuda" which is an independent work; I thought nvidia never released information about the opcodes actually understood by their hardware. — Thomas Pornin, Jan 13 '11 at 13:27
It may be released to registered developers only, although I thought they included it in the latest CUDA release — zenna, Jan 14 '11 at 12:17
cuobjdump just lets you extract *.cubin files or linear disassembly from a host binary file. There is no full reference for FERMI as there is for say, x86. Or can you tell me what flags get set when we perform a subtraction? — avgvstvs, May 01 '14 at 04:42

score 2 · Answer 1 · edited May 23 '17 at 12:02

NVIDIA PTX generation and modification

Not sure how low level it is compared to the hardware (likely undocumented?), but it can be generated from C/C++-like GPU languages, modified and reused in a few ways:

OpenCL clGetProgramInfo(program, CL_PROGRAM_BINARIES + clCreateProgramWithBinary: minimal runnable example: How to use clCreateProgramWithBinary in OpenCL?

These are standardized OpenCL API's, which produce and consume implementation defined formats, which in driver version 375.39 for Linux happens to be human readable PTX.

So you can dump the PTX, modify it, and reload.
nvcc: can compile CUDA GPU-side code to ptx assembly simply with either:
```
nvcc --ptx a.cu
```
nvcc can also compile OpenCL C programs containing both device and host code: Compile and build .cl file using NVIDIA's nvcc Compiler? but I could not find how to get the ptx out with nvcc. Which kind of makes sense since it is just plain C + C strings, and not a magic C superset. This is also suggested by: https://arrayfire.com/generating-ptx-files-from-opencl-code/

And I'm not sure how to recompile the modified PTX and use it as I did with clCreateProgramWithBinary: How to compile PTX code

Using clGetProgramInfo, an input CL kernel:

__kernel void kmain(__global int *out) {
    out[get_global_id(0)]++;
}

gets compiled to some PTX like:

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-21124049
// Cuda compilation tools, release 8.0, V8.0.44
// Based on LLVM 3.4svn
//

.version 5.0
.target sm_20
.address_size 64

    // .globl   _Z3incPi

.visible .entry _Z3incPi(
    .param .u64 _Z3incPi_param_0
)
{
    .reg .pred  %p<2>;
    .reg .b32   %r<4>;
    .reg .b64   %rd<5>;


    ld.param.u64    %rd1, [_Z3incPi_param_0];
    mov.u32     %r1, %ctaid.x;
    setp.gt.s32 %p1, %r1, 2;
    @%p1 bra    BB0_2;

    cvta.to.global.u64  %rd2, %rd1;
    mul.wide.s32    %rd3, %r1, 4;
    add.s64     %rd4, %rd2, %rd3;
    ldu.global.u32  %r2, [%rd4];
    add.s32     %r3, %r2, 1;
    st.global.u32   [%rd4], %r3;

BB0_2:
    ret;
}

Then if for example you modify the line:

add.s32     %r3, %r2, 1;

to:

add.s32     %r3, %r2, 2;

and reuse the PTX modified, it actually increments by 2 instead of 1 as expected.

@talonmies THANK YOU for feedback! Compilation of OpenCL requires the actual C program, just like for CUDA. See: http://stackoverflow.com/questions/13062469/compile-and-build-cl-file-using-nvidias-nvcc-compiler/43298903#43298903 However I was wrong to say that you can extract the `ptx` with `nvcc` for OpenCL, it only works for CUDA (I was testing too many things at the same time). `clGetProgramInfo` was working all along however, exactly as stated. I have updated the answer explaining those points more clearly, and undeleted it. Let me know if you find anything wrong with it. — Ciro Santilli OurBigBook.com, Apr 25 '17 at 18:02

score 2 · Answer 2 · answered Jan 13 '11 at 13:47

In the CUDA driver API, the module management functions allow an application to load at runtime a "module", which is (roughly) a PTX or cubin file. PTX is the intermediate language, while cubin is an already compiled set of instructions. cuModuleLoadData() and cuModuleLoadDataEx() appear to be capable of "loading" the module from a pointer in RAM, which means that no actual file is required.

So your problem seems to be: how to programmatically build a cubin module in RAM ? As far as I know, NVIDIA never released details on the instructions actually understood by their hardware. There is, however, an independent opensource package called decuda which includes "cudasm", a assembler for what the "older" NVIDIA GPU understand ("older" = GeForce 8xxx and 9xxx). I do not know how easy it would be to integrate in a wider application; it is written in Python.

Newer NVIDIA GPU use a distinct instruction set (how much distinct, I do not know), so a cubin for an old GPU ("computing capability 1.x" in NVIDIA/CUDA terminology) may not work on a recent GPU (computing capability 2.x, i.e. "Fermi architecture" such as a GTX 480). Which is why PTX is usually preferred: a given PTX file will be portable across GPU generations.

score 2 · Answer 3 · answered Aug 13 '11 at 16:03

2

An assembler for the NVIDIA Fermi ISA: http://code.google.com/p/asfermi

answered Aug 13 '11 at 16:03

neuron

1,896
1
19
24

score 2 · Answer 4 · answered Aug 15 '11 at 13:28

I've found gpuocelot open-source (BSD Licence) project interesting.

It's "a dynamic compilation framework for PTX". I would call it cpu translator.

"Ocelot currently allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs". As far as I know, this framework do control-flow and data-flow analysis on PTX Kernel in order to apply proper transformations.

score 1 · Answer 5 · answered Jan 13 '11 at 08:04

1

OpenCL is done for that purpose. You provide a program as a string and possibly compile it at runtime. See links provided by other poster.

answered Jan 13 '11 at 08:04

kriss

23,497
17
97
116

As far as I know, OpenCL is compiled at installation time first to intermediate language IL (similar to NVidia's PTX) and then properly compiled into machine instructions. It is the machine instructions I am interested in. – zenna Jan 13 '11 at 11:59
No, you can compile OpenCL on the fly from a string like I wrote. – kriss Feb 11 '16 at 14:00

How to create or manipulate GPU assembler?

5 Answers5

Linked