Optimise Simple Stencil Operation keeping variables into registers

Question

I am trying to make the code below faster keeping two of the variables (the ones we need to reuse) in register or any place closer than the cache. The code takes the three adjacent elements in an array at position idx and adds them together.

void stencil(double * input, double * output){

    unsigned int idx = 1;
    output[0] = input[0] + input[1];

    for(; idx < SIZE - 1; idx++){
        output[idx] = input[idx-1] + input[idx] + input[idx+1];
    }

    output[idx] = input[idx-1] + input[idx];
}

My implementation looks like this:

void stencil(double * input, double * output){

    unsigned int idx = 0;
    double x , y = 0, z;
    z = input[idx];

    for(; idx < SIZE - 1; idx++){
        x = y;
        y = z;
        z = input[idx + 1];
        output[idx] = x + y + z;
    }

    output[idx] = y + z;
}

the idea is to reuse the variables of the previous operation and make the program faster.

However, the program seems not to have improved in terms of speed and performance. I am using gcc on an AMD Opteron(tm) Processor 6320 CPU and I am compiling the code with the following flags: -march=native -O3 -Wall -std=c99.

I tried with and without native, the generated assembly is different but I cannot get better performance. The generated assembly WITHOUT -march=native flag looks like this:

stencil:
.LFB7:
        .cfi_startproc
        subl    $1, %edx
        movsd   (%rdi), %xmm1
        je      .L4
        movq    %rsi, %rcx
        xorpd   %xmm0, %xmm0
        xorl    %eax, %eax
        jmp     .L3
        .p2align 4,,10
        .p2align 3
.L6:
        movapd  %xmm1, %xmm0
        movapd  %xmm2, %xmm1
.L3:
        addl    $1, %eax
        addsd   %xmm1, %xmm0
        addq    $8, %rcx
        movl    %eax, %r8d
        movsd   (%rdi,%r8,8), %xmm2
        leaq    0(,%r8,8), %r9
        addsd   %xmm2, %xmm0
        movsd   %xmm0, -8(%rcx)
        cmpl    %edx, %eax
        jne     .L6
.L2:
        addsd   %xmm2, %xmm1
        movsd   %xmm1, (%rsi,%r9)
        ret
.L4:
        movapd  %xmm1, %xmm2
        xorl    %r9d, %r9d
        xorpd   %xmm1, %xmm1
        jmp     .L2

And WITH the -march=native flag looks like this:

stencil:
.LFB20:
        .cfi_startproc
        vmovsd  (%rdi), %xmm1
        vxorpd  %xmm0, %xmm0, %xmm0
        leaq    144(%rdi), %rdx
        leaq    136(%rsi), %rax
        xorl    %ecx, %ecx
        .p2align 4,,10
        .p2align 3
.L2:
        vaddsd  %xmm1, %xmm0, %xmm0
        vmovsd  -136(%rdx), %xmm4
        prefetcht0      (%rdx)
        addl    $8, %ecx
        prefetchw       (%rax)
        addq    $64, %rdx
        addq    $64, %rax
        vaddsd  %xmm1, %xmm4, %xmm1
        vaddsd  %xmm4, %xmm0, %xmm0
        vmovsd  %xmm0, -200(%rax)
        vmovsd  -192(%rdx), %xmm3
        vaddsd  %xmm3, %xmm1, %xmm1
        vaddsd  %xmm3, %xmm4, %xmm4
        vmovsd  %xmm1, -192(%rax)
        vmovsd  -184(%rdx), %xmm2
        vaddsd  %xmm2, %xmm4, %xmm4
        vaddsd  %xmm2, %xmm3, %xmm3
        vmovsd  %xmm4, -184(%rax)
        vmovsd  %xmm4, -184(%rax)
        vmovsd  -176(%rdx), %xmm0
        vaddsd  %xmm0, %xmm3, %xmm3
        vaddsd  %xmm0, %xmm2, %xmm2
        vmovsd  %xmm3, -176(%rax)
        vmovsd  -168(%rdx), %xmm1
        vaddsd  %xmm1, %xmm2, %xmm2
        vaddsd  %xmm1, %xmm0, %xmm0
        vmovsd  %xmm2, -168(%rax)
        vmovsd  -160(%rdx), %xmm2
        vaddsd  %xmm2, %xmm0, %xmm0
        vaddsd  %xmm2, %xmm1, %xmm1
        vmovsd  %xmm0, -160(%rax)
        vmovsd  -152(%rdx), %xmm0
        vaddsd  %xmm0, %xmm1, %xmm1
        vaddsd  %xmm0, %xmm2, %xmm2
        vmovsd  %xmm1, -152(%rax)
        vmovsd  -144(%rdx), %xmm1
        vaddsd  %xmm1, %xmm2, %xmm2
        vmovsd  %xmm2, -144(%rax)
        cmpl    $1399999992, %ecx
        jne     .L2
        movabsq $11199999944, %rdx
        movabsq $11199999936, %rcx
        addq    %rdi, %rdx
        addq    %rsi, %rcx
        xorl    %eax, %eax
        jmp     .L3
        .p2align 4,,7
        .p2align 3
.L4:
        vmovaps %xmm2, %xmm1
.L3:
        vaddsd  %xmm0, %xmm1, %xmm0
        vmovsd  (%rdx,%rax), %xmm2
        vaddsd  %xmm2, %xmm0, %xmm0
        vmovsd  %xmm0, (%rcx,%rax)
        addq    $8, %rax
        vmovaps %xmm1, %xmm0
        cmpq    $56, %rax
        jne     .L4
        vaddsd  %xmm2, %xmm1, %xmm1
        movabsq $11199999992, %rax
        vmovsd  %xmm1, (%rsi,%rax)
        ret

Does anyone have any suggestion on how to make GCC save the variables into registers to make the code faster? Or any other way to make my code effective bypassing the cache?

restricting x, y and z? but they should be pointer to use the restrict keyword right? and that would make the whole thing slower I think. @OliverCharlesworth — Giacomo Benso, Feb 23 '19 at 17:34
No, `double *restrict input` and `output`, of course, so the compiler knows the output doesn't overlap with the input, and assignments to `output` don't modify `input[idx-1]`. Unless your function needs to work in-place if the caller passes the same pointer for input and output? But it looks like that doesn't make sense. — Peter Cordes, Feb 23 '19 at 18:40
What CPU are you tuning for? And it looks like you're actually using clang, based on the unrolling. The asm version without `-march=native` doesn't do any FP math; it just copies. Is that from a different version of the source? — Peter Cordes, Feb 23 '19 at 18:47
Sorry my bad, it is not an intel CPU, I was using that earlier. I am on the university cluster and the cpu is `AMD Opteron(tm) Processor 6320` . Using GCC compiler. I'll edit the question as well. However I ca also use an `Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz` if that can help me with the optimisation. @PeterCordes — Giacomo Benso, Feb 23 '19 at 19:24
Ok that makes more sense, I think I've seen gcc emit SW prefetch instructions when tuning for AMD. The Opteron is a Piledriver, so `gcc -march=bdver2` https://godbolt.org/z/hCJX0n. But what's going on with the asm in the first block? There are no `addsd` instructions in the loop. — Peter Cordes, Feb 23 '19 at 19:26
Use a newer GCC version. gcc8.3 checks for overlap, then auto-vectorizes with SSE2 or AVX, depending on `-march`. Even in your naive version that doesn't use `restrict`. Ohh, both your asm outputs are for your tweaked version that uses more locals. Yeah, for some reason that defeats auto-vectorization even with gcc8.3 — Peter Cordes, Feb 23 '19 at 19:36
I re-run the compiler without `-march=native` and edited the question, IDK why it did not have the adds instructions, sorry about that. so what do you think I shoud do do you have a solution? also should I use the intel processor instead? @PeterCordes — Giacomo Benso, Feb 23 '19 at 19:37

Peter Cordes · Accepted Answer · 2019-02-26T23:13:51.737

This is a good idea, but compilers will already do this for you if they know it's safe. Use double *restrict output and const double *restrict input to promise the compiler that stores to output[] don't change what will be read from input[].

But auto-vectorization with SIMD is an even more important optimization, producing 2 or 4 double results per instruction. GCC and ICC will already do that at -O3, after checking for overlap. (But clang fails to auto-vectorize this, just unrolling with scalar [v]addsd avoiding unnecessary reloads.

Unfortunately your optimized version defeats auto-vectorization! (This is the compiler's fault, i.e. a missed-optimization bug, when it knows the output doesn't overlap, so rereading the source from memory or not is equivalent).

It looks like gcc does a pretty good job with the original version, with -O3 -march=native (especially when tuning for Intel, where wider vectors with AVX are worth it.) I computes 4 double results in parallel from 3 unaligned loads and 2 vaddpd ymm.

It checks for overlap before using the vectorized loop. You can use double *restrict output and input to promise that the pointers don't overlap so it doesn't need a fallback loop.

L1d cache bandwidth is excellent on modern CPUs; reloading the same data is not a big problem (2 loads per clock). Instruction throughput is more of a problem. Memory-source addsd doesn't cost much more than keeping data in registers.

If vectorizing with 128-bit vectors, it would make sense to keep around the in[idx+1..2] vector for use as the in[idx+ -1..1] vector next iteration. GCC in fact does do this.

But when you're producing 4 results per instruction, none of the 3 input vectors from one iteration are directly useful for the next. Saving some load-port bandwidth with a shuffle to create one of the 3 vectors from a load result would probably be useful, though. I'd try that if I was manually vectorizing with __m256d intrinsics. Or with float with 128-bit __m128 vectors.

#define SIZE 1000000

void stencil_restrict(double *restrict input, double *restrict output)
{
    int idx = 1;
    output[0] = input[0] + input[1];

    for(; idx < SIZE - 1; idx++){
        output[idx] = input[idx-1] + input[idx] + input[idx+1];
    }

    output[idx] = input[idx-1] + input[idx];
}

compiles to this asm with gcc8.3 -O3 -Wall -std=c99 -march=broadwell -masm=intel, from the Godbolt compiler explorer (-ffast-math is not required in this case, and doesn't make a difference to the inner loop.)

stencil_restrict:
    vmovsd  xmm0, QWORD PTR [rdi]
    vaddsd  xmm0, xmm0, QWORD PTR [rdi+8]
    xor     eax, eax
    vmovsd  QWORD PTR [rsi], xmm0           # first iteration

### Main loop
.L12:
    vmovupd ymm2, YMMWORD PTR [rdi+8+rax]         # idx +0 .. +3
    vaddpd  ymm0, ymm2, YMMWORD PTR [rdi+rax]     # idx -1 .. +2
    vaddpd  ymm0, ymm0, YMMWORD PTR [rdi+16+rax]  # idx +1 .. +4
    vmovupd YMMWORD PTR [rsi+8+rax], ymm0         # store idx +0 .. +3
    add     rax, 32                             # byte offset += 32
    cmp     rax, 7999968
    jne     .L12

  # cleanup of last few elements
    vmovsd  xmm1, QWORD PTR [rdi+7999976]
    vaddsd  xmm0, xmm1, QWORD PTR [rdi+7999968]
    vaddsd  xmm1, xmm1, QWORD PTR [rdi+7999984]
    vunpcklpd       xmm0, xmm0, xmm1
    vaddpd  xmm0, xmm0, XMMWORD PTR [rdi+7999984]
    vmovups XMMWORD PTR [rsi+7999976], xmm0
    vmovsd  xmm0, QWORD PTR [rdi+7999984]
    vaddsd  xmm0, xmm0, QWORD PTR [rdi+7999992]
    vmovsd  QWORD PTR [rsi+7999992], xmm0
    vzeroupper
    ret

Unfortunately gcc is using indexed addressing modes, so the vaddpd instructions with a memory source unlaminate into 2 uops for the front-end on SnB-family (including your Broadwell Xeon E5-2698 v4). Micro fusion and addressing modes

    vmovupd ymm2, YMMWORD PTR [rdi+8+rax]         # 1 uop, no micro-fusion
    vaddpd  ymm0, ymm2, YMMWORD PTR [rdi+rax]     # 2 uops.  (micro-fused in decoders/uop cache, unlaminates)
    vaddpd  ymm0, ymm0, YMMWORD PTR [rdi+16+rax]  # 2 uops.  (ditto)
    vmovupd YMMWORD PTR [rsi+8+rax], ymm0         # 1 uop (stays micro-fused, but can't use the port 7 store AGU)
    add     rax, 32                             # 1 uop
    cmp     rax, 7999968                         # 0 uops, macro-fuses with JNE
    jne     .L12                                 # 1 uop

Throughput analysis, see https://agner.org/optimize/ and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

GCC's loop is 8 fused-domain uops for the front-end issue/rename stage to send into the out-of-order back-end. This means the front-end max throughput is 1 iteration per 2 cycles.

[v]addpd in Intel before Skylake can only run on port 1, vs. [v]mulpd or FMA having twice the throughput. (Skylake dropped the dedicated FP add unit, and runs FP add identically to mul and fma.) So that's also a 2 cycle per iteration bottleneck.

We have 3 loads + 1 store, all of which need one of port 2 or 3. (Indexed addressing mode stores can't use the dedicates store-AGU on port 7). So that's yet another 2 cycle per iteration bottleneck. But not really; unaligned loads that cross cache-line boundaries are more expensive. Experiments show that Intel Skylake (and presumably also Broadwell) replays load uops that are discovered to be cache-line split, so they run again to get data from the 2nd cache line. How can I accurately benchmark unaligned access speed on x86_64.

Our data is 8-byte aligned, but we do 32-byte loads evenly distributed over all 8-byte offsets within a 64-byte line. At 5 out of those 8 starting elements, there's no cache-line split. At the other 3, there is. So the average cost is really 3 * (8+3)/8 = 4.125 load uops dispatched per iteration. I don't know if store-address uops need to replay. Probably not; it's just when the data commits from the store buffer to L1d that it matters, not for the store-address or store-data uops. (As long as it's not split across a 4k boundary, which will happen with misaligned output).

Assuming an output alignment of anything other than output[1] being 32-byte aligned. The asm stores output[0] outside the loop, and then effectively does output[i*4 + 1], so every other store will be a cache-line split.

It would be better in this case to reach an alignment boundary for the output array. gcc7 and earlier like to align one of the pointers with a loop prologue, but unfortunately they choose the input where we load from all alignments anyway.

Anyway, GCC's actual bottleneck is port 2 / port 3 throughput. On average 5.125 uops per iteration for those 2 ports = theoretical max average throughput of 1 iteration (4 doubles) per 2.5625 cycles.

Using a non-indexed store would have reduced this bottleneck.

But that's ignoring 4k split penalties, which are ~100 cycles on Broadwell, and assuming perfect HW prefetch which can keep up with ~12.5 bytes / cycle each way (loaded and stored). So more likely this would bottlenck on memory bandwidth unless data was already hot in L2 cache. L1d can absorb the redundant loads of the same bytes, but there's still significant non-redundant bandwidth.

A bit of unrolling would let out-of-order execution see ahead further and help absorb bubbles from cache misses when HW prefetch doesn't keep up. If it used an non-indexed addressing mode for the store, it could use port 7, reducing pressure on ports 2/3. That would let loads run ahead of adds, hopefully absorbing bubbles when crossing

Data reuse in registers with 128-bit vectors

The inner loop from gcc8.3 -O3 -Wall -std=c99 -march=broadwell -mno-avx

 # prologue to reach an alignment boundary somewhere?
.L12:
    movupd  xmm2, XMMWORD PTR [rdi+rax]
    movupd  xmm1, XMMWORD PTR [rdi+8+rax]
    addpd   xmm0, xmm2
    addpd   xmm0, xmm1
    movups  XMMWORD PTR [rsi+rax], xmm0
    add     rax, 16
    movapd  xmm0, xmm1                   # x = z
    cmp     rax, 7999992
    jne     .L12

This is a regression vs. gcc7.4, which avoids the register copy. (But gcc7 wastes loop overhead on a counter separate from the array index.)

 # prologue to reach an alignment boundary so one load can be aligned.

# r10=input and r9=input+8  or something like that
# r8=output
.L18:                                       # do {
    movupd  xmm0, XMMWORD PTR [r10+rdx]
    add     ecx, 1
    addpd   xmm0, xmm1                        # x+y
    movapd  xmm1, XMMWORD PTR [r9+rdx]      # z for this iteration, x for next
    addpd   xmm0, xmm1                        # (x+y) + z
    movups  XMMWORD PTR [r8+rdx], xmm0
    add     rdx, 16
    cmp     ecx, r11d
    jb      .L18                            # } while(i < max);

This is still probably slower than AVX 256-bit vectors, on average.

With AVX for 128-bit vectors (e.g. tuning for Piledriver), it could have avoided the separate movupd xmm0 load, and used vaddpd xmm0, xmm1, [r10+rdx].

Both of them fail to use aligned stores, but also can't take advantage of folding a load into a memory operand for addpd after finding a known alignment in input :/

Actual performance experiments on Skylake show that real performance is fairly close to what I predicted, if data fits in L1d cache.

Fun fact: with static buffers like global double in[SIZE+10];, gcc will make a version of the loop that uses non-indexed addressing modes. This gives a speedup from ~800ms to ~700ms for running it many times in a loop, with SIZE=1000. Will update with more details later.

Alain Merigot · Answer 2 · 2019-02-23T20:13:35.813

0

When using register rotation, it is generally a good idea to unroll the loop. gcc does not do that unless explicitly asked to.

Here is an example with a level 4 loop unrolling.

void stencil(double * input, double * output){

    double x, y, z, w, u, v ;
    x=0.0;
    y=input[0];
    int idx=0;
    for(; idx < SIZE - 5; idx+=4){
      z=input[idx+1];
      w=input[idx+2];
      u=input[idx+3];
      v=input[idx+4];

      output[idx]  =x+y+z;
      output[idx+1]=y+z+w;
      output[idx+2]=z+w+u;
      output[idx+3]=w+u+v;

      x=u;
      y=v;
    }
    z=input[idx+1];
    w=input[idx+2];
    u=input[idx+3];

    output[idx]  =x+y+z;
    output[idx+1]=y+z+w;
    output[idx+2]=z+w+u;
    output[idx+3]=w+u;
}

There is one memory read and write by idx value and 1 register copy every two idx values.

It is possible to try different unroll levels, but there is always 2 registers copy per iteration and 4 seems to be a good compromise.

If size is not a multiple of 4, a prologue is required.

void stencil(double * input, double * output){

    double x, y, z, w, u, v ;
    int idx=0;
    int remain=SIZE%4;

    x=0.0;y=input[0]
    switch (remain) {
    case 3: z=input[++idx]; output[idx-1]=x+y+z; x=y; y=z;
    case 2: z=input[++idx]; output[idx-1]=x+y+z; x=y; y=z;
    case 1: z=input[++idx]; output[idx-1]=x+y+z; x=y; y=z;
    }

    for(; idx < SIZE - 5; idx+=4){
      z=input[idx+1];
      ....

As expected, the asm is rather complex and it is difficult to say what will be the gain.

You can also try to use -funroll-loops on your original code. Compilers are very good and may provide a better solution.

edited Feb 23 '19 at 20:13

answered Feb 23 '19 at 18:45

Alain Merigot

10,667
3
18
31

It looks like the OP's compiler did unroll. They're probably on OS X where `gcc` is actually clang/LLVM. – Peter Cordes Feb 23 '19 at 18:49
With gcc7.4 https://godbolt.org/z/6Xudt3, this manual-reuse defeats auto-vectorization. No `vaddpd` instructions in the asm for yours or the OP's reuse, but there are some `vaddpd` instructions in the asm for the plain naive version of the function (overlapping unaligned loads). I used `-xc -O3 -march=skylake -ffast-math -Wall -std=c99 -funroll-loops` But there's still auto-vectorization without `-funroll-loops`. I'm still not sure which compiler the OP actually has; compiler-generated `prefetchw` is unusual. Maybe Apple clang does it. – Peter Cordes Feb 23 '19 at 19:05
So anyway, your answer is good in theory, and without auto-vectorization or `-funroll-loops`, but in practice either manual vectorization or using the right compiler + options to get the plain version to auto-vectorize looks best. Or maybe something that reuses `y+z` as `x+y` for the next window, if you don't need strict FP. (Not a full sliding window, though; that would create a loop-carried dep chain of add/sub. Just reuse one or two adds in small independent blocks.) – Peter Cordes Feb 23 '19 at 19:07
I am now trying on the intel processor with gcc, do you have any suggestion on the options to be used to auto vectorise the code? or otherwise how would I vectorise the code manually? @PeterCordes – Giacomo Benso Feb 23 '19 at 19:49
1

I tried to improve code generation manually according to suggestion by Peter Cordes. But did not succeed to get a good vectorization. Manual vectorization requires the use of [intrinsics](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#), which is somehow complex. But the code generated with Peter Cordes options is good. The essential options are `-O3` `-ffast-math` and `-funroll-loops` `-O3` is the highest optimization level, `-ffastmath` allows to rearrange fp computations and `-funroll-loops` performs an automatic loop unrolling. – Alain Merigot Feb 23 '19 at 20:01
I tried your approach @AlainMerigot of unrolling the loops manually and the performance is roughly the same as using -funroll-loops. I still cannot figure out how to make the code faster than the naive version. I would like to achieve some performance improvement but I cannot figure out how. The concept as i explained is to place the variables into registers so that the there is no need to go three time to check in the cache. – Giacomo Benso Feb 23 '19 at 20:40
@AlainMerigot: -ffast-math turns out not to be necessary. I didn't look for gcc reusing `y+z` as `x+y` next iter scalar, because that's probably worse than SIMD with unaligned loads. See my answer. – Peter Cordes Feb 23 '19 at 20:47

Optimise Simple Stencil Operation keeping variables into registers

2 Answers2

Data reuse in registers with 128-bit vectors