4

as part of a project that generates x86-64 machine code at runtime, I very often have the need to copy a bit from one register to another register at another bit position.

I came up with this code (example that copies bit 23 of a source register to bit 3 of a destination):

bt eax, 23           ; test bit 23 of eax (source)
setc ebx             ; copy result to ebx (ebx is guaranteed to be zero before)
shl ebx, 3           ; shift up to become our target bit 3
and ecx, 0xFFFFFFF7  ; remove bit 3 from ecx (target)
or ecx, ebx          ; set new bit value

Given that I need five instructions just to copy one bit of one register to another bit of another register, I thought if there is something that uses less instructions on x86?

I've read a bit on BMI instructions but unfortunately they do not offer bit extraction using immediates.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Lyve
  • 218
  • 1
  • 11
  • Are the bit positions in question fixed? – fuz Mar 01 '22 at 18:52
  • Yes they are immediates – Lyve Mar 01 '22 at 18:57
  • [The `setcc` instruction](https://pushbx.org/ecm/doc/insref.htm#insSETcc) only allows setting an 8-bit register. Although, if you correct that to read `bl` in your example it will still work as you commented that `ebx` is zero prior to this. – ecm Mar 01 '22 at 19:56

2 Answers2

4

Number of instructions is not a good performance metric. Either code-size in bytes (x86 machine instructions are variable length), or for how existing mainstream CPUs execute them: number of front-end uops (after decoding) and/or latency / throughput are relevant optimization goals, with the important one depending on the surrounding code for something this small. (What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)

rcr / rcl are unfortunately very slow, especially with count != 1.
Four single-uop instructions are much faster, including a BMI2 rorx to copy-and-rotate the bit you want to insert to the right position in a tmp register, then and to isolate it. Or a normal shift/and if you don't need to preserve the input. That's more efficient than anything I've thought of bouncing it through CF.

This is more efficient than xor-zero/bt/setc/shl. It also avoids partial-register false dependencies or stalls: setc ebx doesn't exist, only setc bl (or setc bh). It also means that if you can destroy your input register instead of using a temporary, you don't need anything inefficient like setc bl / movzx ebx, bl which would put the zero-extension on the critical path for latency, and defeat mov-elimination.

I switched to EDX as a temporary because it's call-clobbered in normal calling conventions.

; input in EAX, merge-target in ECX (read/write)
; no pre-conditions necessary
;  unlike the original which doesn't count the cost of zeroing EDX

   rorx  edx, eax, 23-3       ; shift bit 23 to bit 3.  Use SHR if you don't need the EAX value later
   and   edx, 1<<3            ; and isolate
   and   ecx, 0xFFFFFFF7      ; clear it in the destination
   or    ecx, edx             ; and merge

; total size: 14 bytes of machine code for imm8 masks, 20 for imm32 masks
; 4 uops.

The or could instead be an add or lea if you want because we know there are no overlapping 1 bits. lea being useful as a copy-and-merge in case you want the result in a different register. But if you want that, you'd just or into the temp reg instead of ECX, and you can choose any temp reg you want, including EAX (in which case you could optimize rorx to shr.) add would be useful in case you want to branch on FLAGS from it, since it can macro-fuse with some forms of jcc on Sandybridge-family. xor would also work but has no advantages, and isn't idiomatic for human readers.

lea can be useful to allow just a shr instead of a longer rorx if you don't mind destroying the input. 2-register LEA is 1 byte longer than add or or, but is fast on most CPUs when there's no shift count (AMD) and also no constant displacement. (It can't run on as many ports on Intel before Ice Lake, so use add or or if all else is equal, i.e. when you can't save any insns or latency or whatever with lea.) clang makes good use of it with -O3 (just tune=generic, no -march): https://godbolt.org/z/Kbbh6zs4W ; it also generates the same SHR/AND/AND/LEA with -march=haswell. (I'd guess it wouldn't think of using rorx to preserve that input even when compiling code that did need it later.)

These are all single-uop instructions on Intel and AMD, and your destination bit-position was low enough that both AND masks can fit in a sign-extended-imm8 so the AND instructions are 3 bytes each. (Instead of 6 for and r/m32, imm32). rorx is 6 bytes, though, with a VEX prefix and imm8. The total size is 14 bytes, or 20 bytes if the destination bit is outside the low 7. (Or low 8 if you use byte operand-size like and dl, 0x80 / ... / or cl, dl which causes partial-register problems on P6-family but is fine elsewhere.)

(The instructions used in the question are also single-uop, including bt. On AMD CPUs bts and so on are 2 uops, but bt is just 1.)

With a higher destination bit-number, you could save size using btr ecx, 30 (4 bytes, still 1 uop on Intel) instead of and ecx, ~(1<<30) (6 bytes, or 5 bytes into EAX). But that costs an extra uop on AMD.

Of course if you care about code-size, you'd mov edx, eax / ror edx, 23-3 (5 bytes total) instead of using rorx (6 bytes). So that's a total of 17 bytes with a high destination bitpos. Or 15 if we can destroy EAX.

If the bit positions were runtime-variables, this would be less efficient, needing a variable count shift. (And some subtraction or something to generate shift counts.) A different strategy might be better there.


Another way to exchange bits between registers is with masked XORs, but that's not more efficient here where we don't want to swap, just go one way. And we can use the inverted mask as an immediate. (Or if in a register, with BMI1 andn.)


The main problem is that x86 lacks a bitfield insert. Extract with shift/and is easy enough, although that's still 2 instructions unless you have AMD TBM for the immediate form of bextr, using the XOP encoding. (Bulldozer-family only.) Or use pext if you already have a constant in a register. It would be neat if there was an inverse of bt that deposited CF at the specified position, but there unfortunately isn't.

If there was a bitfield insert instruction, either from CF or from the low bit of another register, you wouldn't need to mask.


Avoiding a temp reg without rcr/rcl - only slightly slower, still 4 uops

@rcgldr shows an interesting trick using rcr/rcl which is good for code size, but unfortunately slow on modern CPUs where rcl and rcr r32, imm are many uops, e.g. 7 uops on Zen3 with 3c throughput, and 8 uops on Sandybridge-family (including Alder Lake) with 6c throughput. (https://uops.info/ / https://agner.org/optimize/)

That's 10 bytes of machine code and doesn't need a temp register. We can duplicate the functionality with only 12 bytes of machine code, but still only 4 single-uop instructions, assuming a modern-enough x86. The above version will be faster on Intel Haswell and earlier.

This has a longer critical-path latency from ECX->result (3 cycles), but same for EAX->result (3 cycles), assuming single-uop adc and rotates. Also more of the uops compete for the shift units, so can't run on as wide a choice of back-end ports. Whether this matters depends on the surrounding code.

It's the same code size even when the destination bit-position is > 8, and for 64-bit mode avoids needing any 8-byte masks.

If you don't have a spare register you can clobber (including the input), this is very likely worth it. Or just for code-size, if this isn't a really hot part of your code.

;; 12 bytes total.  More latency through ECX, and some uops have fewer ports to choose from
   ror   ecx, 3+1         ; 1 uop on Intel HSW and later, and AMD
    ; the bit to be replaced is now at the top of ECX, and in CF
   bt    eax, 23          ; 1 uop
   adc   ecx, ecx         ; 1 uop on Broadwell and later, and AMD.
    ; Shift in the new bit, shifting out the old bit (into CF in case you care)
   rol   ecx, 3           ; 1 uop on HSW and later, and AMD
    ; restore ECX's bits to the right positions, overwriting CF

The initial rotate-right can be rcr or ror; we don't care whether the bit we want to replace is shifted into the top bit temporarily, or only into CF. ror is much faster.

We basically emulate rcl ecx, 3+1 with rcl ecx, 1 and then rol ecx, 3. It differs in FLAGS output I think, but matches in ECX result and how it reads from FLAGS.

And then replace rcl r32, 1 with the equivalent but faster adc same,same; they differ only in the FLAGS output. adc doesn't have any weird partial-flags writing (leaving most of SPAZO unaffected) that makes rotates more expensive on Intel. adc was 2 uops on Intel before Broadwell, but has been 1 uop on AMD for a long time.

This goes through FLAGS using bt so it can easily support a runtime-variable source bit position. For a variable destination bit-position, you'd have to calculate shift counts, and ror reg, cl is slower (3 uops on Intel). Unfortunately there's no variable-count rorx, only shlx / shrx.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • @rcgldr: Yes; while editing I discovered another optimization to the rotate version and had to test it some, making it also 4 uops on modern x86 (only worse on Haswell and earlier I think). That also required rewriting some of the text about when it's worth using, now that it's only barely if any slower. Mostly just back-end contention and critical-path latency. You could also use ROR instead of RCR, I think, unless I'm mixing up which bits go where. – Peter Cordes Mar 02 '22 at 09:25
  • Amazing answer, outlines many helpful things I wasn't sure about, thanks a lot! – Lyve Mar 02 '22 at 14:09
  • @rcgldr: right, same for mine. Thanks again for noticing that `adc ecx,0` instead of `adc ecx,ecx` didn't make sense. – Peter Cordes Mar 02 '22 at 14:26
  • @rcgldr: On your IvB, my first version, with `rorx` replaced with `mov`/`shr` (or just SHR if you don't need the original) is still 4 uops, but the rotate-based version is more uops. Being good on older Intel is one significant advantage, and why I kept it first in my answer. On Skylake, including your Comet Lake, the 2nd version is likely the same speed for some use-cases, except when critical path latency from ECX -> output matters. (2c vs. 3c) – Peter Cordes Mar 02 '22 at 14:28
  • @PeterCordes What about `lea` instead of `or` for merging? Good idea, bad idea? – njuffa Mar 04 '22 at 09:10
  • 1
    @njuffa: As I said, it's only useful for the fact that it can copy-and-merge, but that's unnecessary. It costs an extra byte to use an indexed addressing mode (2 regs), and it can't run on as many back-end ALU ports on some CPUs. On old in-order Atom CPUs, it runs earlier in the pipeline which has implications for latency forwarding to vs. from it. I didn't want to clutter that paragraph with more `lea` details; I'd hoped most readers could figure out (that I was implying) it wasn't useful, but I think you're saying that isn't obvious enough. – Peter Cordes Mar 04 '22 at 09:29
  • @PeterCordes Since Clang and the latest Intel compilers seem to prefer `lea` in this context, I was wondering about the tradeoffs. They were not obvious to me. `#define SRC_BIT (23) #define DST_BIT (3) uint32_t bit23_to_bit3 (uint32_t a, uint32_t b) { return (a & (1 << SRC_BIT)) >> (SRC_BIT - DST_BIT) | (b & ~(1 << DST_BIT)); }` – njuffa Mar 04 '22 at 10:15
  • @njuffa: Ah, but clang isn't using `rorx` there; I was only talking about the case where you are. But it actually saves code size to use LEA there, and shouldn't hurt latency since that addressing mode is simple enough to not be a slow-LEA on any mainstream CPUs, IIRC. Updated my answer to mention using `shr` instead of `rorx` as a use-case for `lea`. – Peter Cordes Mar 04 '22 at 10:24
3

Alternative:

        rcr     ecx,3+1
        bt      eax,23
        rcl     ecx,3+1
rcgldr
  • 27,407
  • 3
  • 36
  • 61
  • 1
    That's fewer instructions, but unfortunately not faster. RCL/RCR with a count other than 1 is many uops on modern CPUs. For example, 8 uops on Skylake for `rcr/l r32, imm8` or `cl`, with 6 cycle throughput. (https://uops.info/ even tested with an instruction to break the FLAGS dependency, and with multiple dep chains, but still found 6c throughput.) It's less bad on Zen3, but still 7 uops with 3c throughput for each rotate-through-carry separately. Plain rotate, and rcr-by-1, are cheaper so you could gain performance by maybe rotating by 3, then rcr 1/bt/rcl 1, then normal rotate back. – Peter Cordes Mar 02 '22 at 03:48
  • @PeterCordes - what about Comet Lake? – rcgldr Mar 02 '22 at 05:06
  • Comet Lake cores are microarchitecturally the same as Skylake, except for maybe some Spectre / Meltdown mitigation. IDK why uops.info bothers even having separate entries for Kaby Lake and Coffee Lake in it's table; maybe just to experimental prove they don't different from SKL at all? Or maybe there's some instruction that's different which I don't know about. Agner Fog also added a Coffee Lake section to his instruction tables, but again IDK why; his microarch guide groups it with SKL and doesn't mention any diffs. – Peter Cordes Mar 02 '22 at 05:13