How to detect Register Smashing on an Intel CPU in a Multithreaded CUDA application written in C and Python under Linux?

Question

I am currently trying to debug a very large application with many different modules, some written in C, and some in Python. It uses both multithreading and CUDA. It is running on a modern Intel processor under Linux.

Currently I have a test use case that runs for about an hour in a loop and then segfaults with an assertion error. Looking at the stack trace, it shows that I am calling g_signal_disconnect(obj, sig) with a valid value for sig, but that g_signal_disconnect is seeing a nonsensical value for sig. It appears that between the registers being set up for the call and the actual call something happens to change the %rsi register that holds the sig value. That is, the stack frame for the caller shows the correct value for sig in the local variable and in the register, but the callee sees a large random number instead. I'm guessing some other task runs or an external interrupt occurs and causes the issue but that is completely a guess.

This bug is consistent in that its always this particular call that gets smashed, but it only happens randomly once in thousands (hundreds of thousands?) of executions of this call. It also doesn't seem to matter if I am running natively, under gdb, or under valgrind. It still happens.

Because its a register being changed, I can't get gdb to set a watchpoint on it to see what is changing it. Nor can gdb run code in reverse in a multithreaded environment.

Because its a CUDA application, I cannot use rr-debugger to record the exact stream of instructions that causes the issue.

And although I can run the program under valgrind and get some results, it only tells me that the sig value is undefined when I go to use it, not when something made it undefined. Nor does valgrind show any memory or multitasking errors that might reasonably be the culprit.

Now, I do have full access to the source code of the module in which the bug happens, so I can instrument it anyway that makes sense, or recompile it so long as those compilation options are compatible with the rest of the linux stack it runs on, so there may be something I can do, but I don't know what.

Just finding some way to know which tasks runs and/or interrupts occur during the register-smashing window would go a long way to narrowing things down, but I don't know how to obtain that info either.

Does anyone know of any tools, tips, techniques, or whatnot that will allow me to catch the register-smasher in the act? Once I know what routine is to blame, it should be possible to fix it.

Assuming there is no bug in the kernel parts, one scenario that would fit is that the task gets interrupted, the registers are saved on the stack, then corrupted by something, then restored. If this is the case, then the corruption is very limited, or else you would have a destroyed stack. You can try changing the stack layout a bit, by adding volatile local variables for example and see if the symptoms change. If that works, you can attempt to aim the corruption point on an unused dummy variable and put a data breakpoint there to see what overrides it. — ElderBug, Oct 04 '22 at 22:08
If the corruption seems to follow no matter what you do on the stack, then it is more likely that the corruption comes from the same stack, that is, the interrupting code. But that doesn't sound possible since you said the bug is very localized. Posting the disassembly of the calling function could help, and any additional detail you can find. — ElderBug, Oct 04 '22 at 22:11
While a signal handler is running, the thread's "normal" register state is in memory on the user stack, and IIRC modification to it will be applied to that thread upon returning from the signal handler. So an out-of-bounds memory access could be modifying it. (Perhaps you took the address of a local var and used it after the function's scope ended, and it happened to bite you when the signal handler context ended up in the same place.) @ElderBug: On a normal context switch, user-space registers are only saved on the kernel stack. User-space doesn't need to have a valid stack. — Peter Cordes, Oct 05 '22 at 02:51
For an example of how a signal handler is supposed to access that state, see [Linux signal handling. How to get address of interrupted instruction?](https://stackoverflow.com/q/34989829) - the handler gets an extra arg of type `ucontext_t*`, a user-space context. — Peter Cordes, Oct 05 '22 at 02:53
So, I've tried a bunch of things in the last week. Creating a few volatile variables on the stack in the affected routine has caused a number of changes. I'm still getting the error, but now the data structures that hold the value I need are corrupted. But it looks like they are getting corrupted way before the bug is detected. So, I'm still looking for the ultimate cause, but I think I'm getting closer. — swestrup, Oct 17 '22 at 21:56
RR does not support processes that run CUDA as its undocumented. I haven't tried UDB but I suspect its a similar story. — swestrup, Oct 18 '22 at 15:44
If the program is portable to Windows / mingw and the problem still appears, you could also use the windbg (which has a working time travel in the preview version). — Sebastian, Oct 18 '22 at 16:39
The code is highly Linux-specific and it would take a major effort to port to Windows. Probably man-months of work. — swestrup, Oct 18 '22 at 19:20
I want to thank Sebastian for the suggestion of Undo UDB, its a commercial product but (as I write this) has a free trial. It partially supports CUDA (alas, not sufficiently well for my purposes -- but they are improving it all the time.) If you need to debug a similar issue in Linux (multitasking+cuda) they may be a godsend. — swestrup, Oct 20 '22 at 14:48
Sebastian, put in your UDB suggestion as an answer and I'll mark it correct. — swestrup, Oct 20 '22 at 15:28

score 2 · Accepted Answer · answered Oct 20 '22 at 15:27

Okay, thanks to everyone for their help. To address the actual question I asked, this kind of thing is currently best addressed by a debugger that can record and replay multithreaded instruction streams. RR-Debugger does that and is open source but does not support CUDA. Undo UDB is commercial and has partial support for CUDA. Currently its your best bet in a similar circumstance (although in my case it's CUDA support was insufficient). Both of these are add-ons to GDB's recording facility.

Now, as to the actual bug, which has finally been found and fixed, it turned out NOT to be Register Corruption, but merely looked like it. It turned out to be a data race issue. I'm rather embarrassed to have made this particular mistake, but it is what it is. A rough paraphrase of the code follows:

void signal_setup(...)
  { struct signal_data * data = malloc(sizeof(struct signal_data));

    data->a = ...
    data->b = ...
    data->sig = g_signal_connect(obj, "sig", signal_cb, data,...);

    ...
  }

void signal_cb( GObject * obj, void * user_data )
  { struct signal_data * data = user_data;

    g_signal_disconnect(obj, data->sig);

    ...

    free(data);
  }

It turns out that about one time in every 200,000 calls or so, the signal would be triggered between the call the to g_signal_connect and its signal id being stored in data->sig. This would result in the value being pulled out of data->sig in the callback being random junk, which g_signal_disconnect would (rightly) complain about.

However, because the callback was in a different thread than the signal_setup routine, the signal_setup would complete a few milliseconds later and finish filling in the struct signal_data so that it would be correct. The upshot was that when I looked at the stack frames in the debugger, the data structure had valid data, but the register that had been read from that structure was garbage. I thus assumed register corruption in a narrow window.

I didn't find the real bug until I put in timestamped logging of each signal setup and each signal callback, and saw a callback before the setup, just before the crash.

Thank you for sharing the detailed post mortem! – Sebastian Oct 20 '22 at 15:59 — Sebastian, Oct 20 '22 at 15:59

score 0 · Answer 2 · answered Oct 26 '22 at 17:50

One other possible approach one could use in this case is to use systemtap to monitor such things as task switches and memory changes. As its fully scriptable, one can be as precise as what you want to monitor as you like. There is a learning curve to figure out its scripting language, but its an excellent tool for this kind of complex problem.

How to detect Register Smashing on an Intel CPU in a Multithreaded CUDA application written in C and Python under Linux?

2 Answers2