0

I have a test program

void task()
{
   long sum=0;
   while(true) {
      for(int i=0; i<100000;++i)
         for(int j=0; j < 10000; ++j)
            sum += j
   }
}
int main()
{
   boost::thread t(task);
   t.join();
   return true;
}

and I use perf record -t -e intel_pt// to profile the thread. I see a lot of calls to nmi, like 2(?) calls every some milliseconds intervals. Is there anyways to get rid of this type of interrupts? Centos 7, kernel 3.10.0, kernel.watchdog = 0.

EDIT based on comment from @Zulan : gcc version: 4.8.5. compile line: g++ -I//include/ -L//lib -lboost_system -lboost_thread test.cpp Processor Intel Platinum 8xxx/Gold 6xxx series.
perf script output snippnet:

       a.out 25192 [009]  3976.829805:          1     branches:  ffffffff816ac7f5 nmi ([kernel.kallsyms]) => ffffffff816ad500 do_nmi ([kernel.kallsyms])
       a.out 25192 [009]  3976.829805:          1     branches:  ffffffff816ad54f do_nmi ([kernel.kallsyms]) => ffffffff816ad5f3 do_nmi ([kernel.kallsyms])
       a.out 25192 [009]  3976.829805:          1     branches:  ffffffff816ad5fa do_nmi ([kernel.kallsyms]) => ffffffff8103dc10 is_debug_stack ([kernel.kallsyms])
       a.out 25192 [009]  3976.829805:          1     branches:  ffffffff8103dc62 is_debug_stack ([kernel.kallsyms]) => ffffffff816ad5ff do_nmi ([kernel.kallsyms])
       a.out 25192 [009]  3976.829805:          1     branches:  ffffffff816ad60e do_nmi ([kernel.kallsyms]) => ffffffff816ad558 do_nmi ([kernel.kallsyms])
       a.out 25192 [009]  3976.829805:          1     branches:  ffffffff816ad573 do_nmi ([kernel.kallsyms]) => ffffffff8113a3f0 rcu_nmi_enter ([kernel.kallsyms])
       a.out 25192 [009]  3976.829805:          1     branches:  ffffffff8113a41e rcu_nmi_enter ([kernel.kallsyms]) => ffffffff810ba170 __smp_mb__before_atomic ([kernel.kallsyms])
       a.out 25192 [009]  3976.829805:          1     branches:  ffffffff810ba17a __smp_mb__before_atomic ([kernel.kallsyms]) => ffffffff8113a423 rcu_nmi_enter ([kernel.kallsyms])
       a.out 25192 [009]  3976.829805:          1     branches:  ffffffff8113a427 rcu_nmi_enter ([kernel.kallsyms]) => ffffffff810ba180 __smp_mb__after_atomic ([kernel.kallsyms])
       ...
       a.out 25192 [009]  3976.829807:          1     branches:  ffffffff8113a4aa rcu_nmi_exit ([kernel.kallsyms]) => ffffffff8113a48a rcu_nmi_exit ([kernel.kallsyms])
       a.out 25192 [009]  3976.829807:          1     branches:  ffffffff8113a48c rcu_nmi_exit ([kernel.kallsyms]) => ffffffff816ad592 do_nmi ([kernel.kallsyms])
       a.out 25192 [009]  3976.829807:          1     branches:  ffffffff816ad5ed do_nmi ([kernel.kallsyms]) => ffffffff816ad760 do_nmi ([kernel.kallsyms])
       a.out 25192 [009]  3976.829807:          1     branches:  ffffffff816ad76e do_nmi ([kernel.kallsyms]) => ffffffff816ac7fa nmi ([kernel.kallsyms])
       a.out 25192 [009]  3976.829807:          1     branches:  ffffffff816ac804 nmi ([kernel.kallsyms]) => ffffffff816ac23c restore_args ([kernel.kallsyms])
       a.out 25192 [009]  3976.829808:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           40576f task (a.out)

another one:

       a.out 25192 [009]  3976.829808:          1     branches:  ffffffff816b6718 irq_work_interrupt ([kernel.kallsyms]) => ffffffff8102f620 smp_irq_work_interrupt ([kernel.kallsyms])
       a.out 25192 [009]  3976.829808:          1     branches:  ffffffff8102f629 smp_irq_work_interrupt ([kernel.kallsyms]) => ffffffff81090d20 irq_enter ([kernel.kallsyms])
       a.out 25192 [009]  3976.829808:          1     branches:  ffffffff81090d29 irq_enter ([kernel.kallsyms]) => ffffffff8113a360 rcu_irq_enter ([kernel.kallsyms])
       a.out 25192 [009]  3976.829808:          1     branches:  ffffffff8113a3a0 rcu_irq_enter ([kernel.kallsyms]) => ffffffff8113a3b8 rcu_irq_enter ([kernel.kallsyms])
       a.out 25192 [009]  3976.829808:          1     branches:  ffffffff8113a3c0 rcu_irq_enter ([kernel.kallsyms]) => ffffffff81137e10 rcu_eqs_exit_common.isra.31 ([kernel.kallsyms])
       ...
       a.out 25192 [009]  3976.829809:          1     branches:  ffffffff8113a27d rcu_irq_exit ([kernel.kallsyms]) => ffffffff8113a260 rcu_irq_exit ([kernel.kallsyms])
       a.out 25192 [009]  3976.829809:          1     branches:  ffffffff8113a26e rcu_irq_exit ([kernel.kallsyms]) => ffffffff81090e4e irq_exit ([kernel.kallsyms])
       a.out 25192 [009]  3976.829809:          1     branches:  ffffffff81090e50 irq_exit ([kernel.kallsyms]) => ffffffff8102f653 smp_irq_work_interrupt ([kernel.kallsyms])
       a.out 25192 [009]  3976.829809:          1     branches:  ffffffff8102f654 smp_irq_work_interrupt ([kernel.kallsyms]) => ffffffff816b671d irq_work_interrupt ([kernel.kallsyms])
       a.out 25192 [009]  3976.829809:          1     branches:  ffffffff816b671d irq_work_interrupt ([kernel.kallsyms]) => ffffffff816ac1ed ret_from_intr ([kernel.kallsyms])
       a.out 25192 [009]  3976.829809:          1     branches:  ffffffff816ac234 retint_swapgs ([kernel.kallsyms]) => ffffffff816ac23c restore_args ([kernel.kallsyms])
       a.out 25192 [009]  3976.829809:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           40576f task (a.out)

grep'g for irq_return() from perf script output shows 3 irq per 20ms.

           a.out 25192 [009]  3976.809174:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           405764 task (a.out)
       a.out 25192 [009]  3976.809178:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           405764 task (a.out)
       a.out 25192 [009]  3976.829808:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           40576f task (a.out)
       a.out 25192 [009]  3976.829809:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           40576f task (a.out)
       a.out 25192 [009]  3976.850317:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           405764 task (a.out)
       a.out 25192 [009]  3976.850393:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           405764 task (a.out)
       a.out 25192 [009]  3976.850395:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           405764 task (a.out)
       a.out 25192 [009]  3976.870945:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           405764 task (a.out)
       a.out 25192 [009]  3976.870984:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           40576f task (a.out)
       a.out 25192 [009]  3976.870985:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           40576f task (a.out)
       a.out 25192 [009]  3976.891571:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           40576f task (a.out)
       a.out 25192 [009]  3976.891609:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           40576f task (a.out)
       a.out 25192 [009]  3976.891611:          1     branches:  ffffffff816ac26c irq_return ([kernel.kallsyms]) =>           40576f task (a.out)

in grub, isolcpus, nohz_full, rcu_nocbs are set on particular cpus, and nmi_watchdog=0. test program ran with taskset -c a.out.

surfcode
  • 445
  • 1
  • 5
  • 20
  • 2
    Please provide 1) compiler version 2) compilation command 3) snippet from the actual `perf script` output 4) processor specification 4) comparison with a non-threaded version of this code. – Zulan Jun 13 '19 at 10:49
  • What hardware / motherboard? Do you get the same thing on different HW running similar software (if you have other machines running such an old kernel)? – Peter Cordes Jun 13 '19 at 15:04
  • NMI **is** the mechanism used by `perf`. Don’t use `perf` will be a solution. – 0andriy Jun 14 '19 at 05:11
  • @0andriy are you saying that these are inserted by the perf process, not the program itself? how does that work -- a.out is running on the taskset core, while perf record is running on another. could you clarify? thanks – surfcode Jun 14 '19 at 13:58
  • Not the user-space `perf` process; the hardware PMU reports (to the kernel's perf code) that its buffer is full using an NMI interrupt. Or for non-PEBS events for perf-record type sampling, interrupt every time the event's counter wraps around because there's no buffer. Linux's NMI-watchdog uses up one of the programmable PMU counters, so that's another confirmation that using performance counters can generate NMIs. (There are other ways of using them, like just letting them count and collect the data at the end, like `perf stat` does, but it might still need to collect data periodically.) – Peter Cordes Jun 14 '19 at 16:17
  • @Peter Cordes Thanks Peter. Initially my goal is to make the spin loop doesn’t get time taken away by kernel calls, or as little as possible, thus the nohz_full setting. So what is that 20ms fixed timer — is it in perf, configurable? thx – surfcode Jun 14 '19 at 23:29
  • 1
    How are you able to use intel_pt and access grub on AWS? – BeeOnRope Jun 15 '19 at 04:18
  • 3
    @BeeOnRope: [see here](https://aws.amazon.com/about-aws/whats-new/2019/02/introducing-five-new-amazon-ec2-bare-metal-instances/) – surfcode Jun 17 '19 at 00:14

1 Answers1

2

First of all, if you want to trace kernel you will get a lot of kernel function calls in the trace, including but not limited to NMI related ones.

If you want to trace your user space program like you posted above, use

perf record -t -e intel_pt//u ...

Better to consult with documentation https://elixir.bootlin.com/linux/latest/source/tools/perf/Documentation/intel-pt.txt

0andriy
  • 4,183
  • 1
  • 24
  • 37
  • Those calls though, are reported to occur on CPU 9, same cpu as my program and it takes up microsecond(s) at a time. Doesn’t that mean time is spent on kernel NMI calls away from my spin loop? And if so how can get rid of nmi interrupts — my goal is to take away interrupts as much as possible from the spin loop. The first thing I tried was to put nohz_full in the boot command. – surfcode Jun 14 '19 at 23:13
  • @surfcode: you can get rid of them by *not* running your program under `perf record`. They are part of how the kernel collects PMU data. i.e. they are part of the measurement overhead inherent in using `intel_pt`. There might be options to use bigger buffers so they happen less often or something... – Peter Cordes Jun 15 '19 at 00:29
  • ok -- let me see if I understand this correctly. these interrupts comes from *perf record*. so they normally don't exist if I don't run the program under *perf record*. sounds like if I want to check timing/hardware/clock/etc interrupts after system tuning, I'd better off use *perf stat* or some other tools. Is that right? – surfcode Jun 15 '19 at 04:03