1

I am trying to load and store YMM registers using gcc inline asm. I use vmovdqa for doing this.

For storing __m256i to a particular YMM register (say YMM10), I use the following code

__m256i addr;
//load value to addr
asm ("vmovdqa %0,%%ymm10\n\t"
            :
            : "x" (addr)
            :);

And for loading a value from YMM10 to a variable, I use the following code

__m256i readbuff;
asm ("vmovdqa %%ymm10,%0\n\t"\
            : "=x" (readbuff)\
            :\
            :);

The problem I am facing here is that after I load YMM10 with a value, I use only one half of the register loaded with value. I mean only 128 bits are loaded and other half is all zeroes.

Am I doing anything wrong? I am not sure what instruction to use - vmovdqa, vmovaps, vmovups. Please advice me on this.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
krishnan
  • 33
  • 6
  • 5
    This sort of inline assembly usage is incorrect and cannot be expected to work. You cannot assume that values you assign to random registers are still going to be there in the next `asm` statement and neither are you allowed to overwrite random registers without placing the register in a clobber list. A better solution to your [underlying problem](https://xyproblem.info) could perhaps be found if you presented some more context. – fuz Dec 09 '20 at 16:01

1 Answers1

5

The overall design of what you seem to be trying to do with inline asm is broken. This is not how inline asm works. This is probably an X-Y problem; there's something you want your code to do, and you've picked a non-viable approach.

I mean only 128 bits are loaded and other half is all zeroes.

Sounds like GCC did a veroupper somewhere, probably at a function call boundary, between your asm statements. You didn't tell GCC that YMM10 was an output you expected to read later. (Kind of similar to how GCC doesn't push registers around my inline asm function call even though I have clobbers is using inline asm incorrectly). In this case GCC stepped on your data; in other cases you could destroy some data GCC had put there and was going to read again later.

You could tell GCC about the data coming out of your asm statement with another __m256i variable, perhaps a register __m256i ymm10 asm("ymm10") if you really want to convince the compiler to make worse asm instead of just letting it keep __m256i variables in registers like it normally does.

But seriously don't. You can look at GCC's asm output with gcc -S foo.c -o- | less or whatever. (Don't forget the usual -O3 -march=native or whatever). How to remove "noise" from GCC/clang assembly output? Using your own vmovdqa instructions on some of the YMM registers, while GCC uses other YMM registers for its own purposes, is just going to make worse asm. https://gcc.gnu.org/wiki/DontUseInlineAsm

See also https://stackoverflow.com/tags/inline-assembly/info for guides and docs that explain how to user GNU C inline asm correctly. (But you probably don't need asm at all.)


Note that "x" (addr) requires GCC to already have __m256i addr in another YMM register, so it's not even "loading" from memory, it's just copying YMM registers after GCC already loaded it from memory if necessary. That's why I said so strongly that this is pointless.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847