I am currently trying to debug a very large application with many different modules, some written in C, and some in Python. It uses both multithreading and CUDA. It is running on a modern Intel processor under Linux.
Currently I have a test use case that runs for about an hour in a loop and then segfaults with an assertion error. Looking at the stack trace, it shows that I am calling g_signal_disconnect(obj, sig) with a valid value for sig, but that g_signal_disconnect is seeing a nonsensical value for sig. It appears that between the registers being set up for the call and the actual call something happens to change the %rsi register that holds the sig value. That is, the stack frame for the caller shows the correct value for sig in the local variable and in the register, but the callee sees a large random number instead. I'm guessing some other task runs or an external interrupt occurs and causes the issue but that is completely a guess.
This bug is consistent in that its always this particular call that gets smashed, but it only happens randomly once in thousands (hundreds of thousands?) of executions of this call. It also doesn't seem to matter if I am running natively, under gdb, or under valgrind. It still happens.
Because its a register being changed, I can't get gdb to set a watchpoint on it to see what is changing it. Nor can gdb run code in reverse in a multithreaded environment.
Because its a CUDA application, I cannot use rr-debugger to record the exact stream of instructions that causes the issue.
And although I can run the program under valgrind and get some results, it only tells me that the sig value is undefined when I go to use it, not when something made it undefined. Nor does valgrind show any memory or multitasking errors that might reasonably be the culprit.
Now, I do have full access to the source code of the module in which the bug happens, so I can instrument it anyway that makes sense, or recompile it so long as those compilation options are compatible with the rest of the linux stack it runs on, so there may be something I can do, but I don't know what.
Just finding some way to know which tasks runs and/or interrupts occur during the register-smashing window would go a long way to narrowing things down, but I don't know how to obtain that info either.
Does anyone know of any tools, tips, techniques, or whatnot that will allow me to catch the register-smasher in the act? Once I know what routine is to blame, it should be possible to fix it.