resources where I could read about this type of thing
See Agner Fog's microarch pdf, and his optimizing assembly guide. Also other links in the x86 tag wiki (e.g. Intel's optimization manual).
The interesting option you didn't mention is:
mov %rbx, %rcx
imul %rbx, %rbx # doesn'y have to wait for mov to execute
# old value of %rbx is still available in %rcx
If the imul is on the critical path, and mov has non-zero latency (like on AMD CPUs, and Intel before IvyBridge), this is potentially better. The result of imul will be ready one cycle earlier, because has no dependency on the result of the mov.
If, however, the old value is on the critical path and the squared value isn't, then this is worse because it adds a mov to the critical path.
Of course, it also means you have to keep track of the fact that your old variable is now live in a different register, and the old register has the squared value. If this is a problem in a loop, unroll it so you can end up with things where the top of the loop is expecting them. If you wanted this to be easy, you'd use a compiler instead of optimizing asm by hand.
However, Intel P6-family CPUs (PPro/PII to Nehalem) have limited register-read ports, so it can be better to favour reading registers that you just wrote. If the %rbx wasn't written in the last couple cycles, it will have to be read from the permanent register file when the mov and imul uops go through the rename&issue stage (the RAT).
If they don't issue as part of the same group of 4, then they would each need to read %rbx separately. Since the register file in Core2/Nehalem only has 3 read ports, issue groups (quartets, as Agner Fog calls them) stall until all their not-recently-written input register values are read from the register file (at 3 per cycle, or 2 on Core2 is none of the 3 regs are index regs in an addressing mode).
For the full details, see Agner Fog's microarch pdf section 8.8. The Core2 section refers back to the PPro section. PPro has a 3-wide pipeline, so in that section Agner talks about triplets, not quartets.
If mov and imul issue together, they both share the same read of %rbx. There's a 3 in 4 chance of this happening on Core2/Nehalem.
Choosing just between the sequences you mention the first one has a clear (but usually small) advantage over the second for Intel P6-family CPUs. There's no difference for other CPUs, AFAIK, so the choice is obvious.
mov %rbx, %rcx
imul %rcx, %rcx # uses only the recently-written rcx; can't contribute to register-read stalls
worst of both worlds:
mov %rbx, %rcx
imul %rbx, %rcx # can't execute until after the mov, but still reads a potentially-old register
If you're going to depend on a recently-written register, you might as well use only recently-written registers.
Intel Sandybridge-family uses a physical register file (like AMD Bulldozer-family), and doesn't have register-read stalls.
Ivybridge (2nd gen Sandybridge) and later also handle mov reg,reg at register rename time, with zero latency and no execution unit. This means it doesn't matter whether you imul rbx or rcx as far as critical path length.
However, AMD Bulldozer-family can only handle xmm register moves in its rename stage; integer register moves still have 1c latency.
It's potentially still worth caring about which dependency chain the mov is part of, if latency is a limiting factor in the cycles per iteration of a loop.
how to benchmark this
I think you could put together a microbenchmark that has a register read stall on Core2 with imul %rbx, %rcx, but not with imul %rcx, %rcx. However, that would require some trial and error to get the mov and imul to issue in different groups, and unless you're feeling really creative, probably some artificial-looking surrounding code that exists only to read lots of registers. (e.g. lea (%rsi, %rdi, 1), %eax, or even add (%rsi, %rdi, 1), %eax (which has to read all three registers, and does micro-fuse on core2/nehalem so it only takes 1 uop slot in an issue group. (It doesn't micro-fuse on SnB-family)).