1

Some SSE instructions take one scalar input for one scalar output, such as, sqrtss, rsqrtss, rcpss, ... These instructions don't change the upper bits of the output register, so I believe it has a dependency on the output register.

Is it worth putting an extra xorps to break the dependency when the output register of such an instruction is different from the input register?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
xiver77
  • 2,162
  • 1
  • 2
  • 12
  • Yes, if the false dependency is loop-carried or otherwise causes a problem in the surrounding context. If your source operand was already in a register, you might consider `rsqrtps` to avoid that, if you want to write the result to a separate register so you keep the original around. But for `sqrtps`, some CPUs (e.g. Alder Lake E-cores) have twice the throughput for `sqrtsd` vs. `sqrtpd`. But only low-power CPUs, not big cores it seems; big cores only break things up in 128-bit chunks as far as getting more throughput out of wide dividers. – Peter Cordes May 23 '22 at 13:40
  • This seems like a duplicate, but were you intending to ask about advice for how to decide when on a case-by-case basis? That could be a separate question from what compilers do, especially if the question linked the Q&A about what clang and GCC do, with clang only breaking false deps inside loops. – Peter Cordes May 23 '22 at 13:45
  • (And BTW, even as far back as Conroe on https://uops.info/, sqrtss has the same throughput as sqrtps for "big core" CPUs, also zen1 / zen2, it's only silvermont-family where everything slower, and 128-bit is half speed of scalar. (eg. Alder-E has 24c sqrtpd, 12c sqrtsd throughput, partially pipelined so latency is even worse. sqrtps/ss are twice as fast as their double equivalents on that CPU, so 6c throughput for sqrtss. Older low-power like Goldmont really skimped on FPU hardware, like 67c throughput for sqrtpd / 19c sqrtss) – Peter Cordes May 23 '22 at 13:52

0 Answers0