Timing speed of math operations on Arduino - anomaly

Question

I was doing some research into integer division on the Arduino Uno, and trying to determine the fastest methods of solving division.

In readings, I came across this post.

I grabbed and modified the sketch that Nick Gammon posted in reply #4, the ultimate aim being to compile a list of the computations that I need to perform, with their timings.

Here is my modified sketch:

#define DATATYPE byte
#define OPERATION i / j

volatile DATATYPE x=0;
volatile DATATYPE y=0;
//volatile DATATYPE j=7;
unsigned long time;
unsigned long loop_time;

void setup ()
{
    Serial.begin(115200);
    Serial.println ();

    time = micros();
    for (DATATYPE i = 1; i < 101; i++) {
        //for (DATATYPE j = 1; j < 101; j++) {
        for (DATATYPE j = 56; j < 57; j++) {
            y++;
        } // end of for j
    }  // end of for i
    loop_time = micros() - time;
    Serial.print("Loop baseline Completion time: ");
    Serial.print(loop_time);
    Serial.print(" ");

    time = micros();
    for (DATATYPE i = 1; i < 101; i++) {
        //for (DATATYPE j = 1; j < 101; j++) {
        for (DATATYPE j = 56; j < 57; j++) {
          y++;
            x = OPERATION;
        } // end of for j
    }  // end of for i
    time = micros() - time;
    Serial.print("Completion time: ");
    Serial.print(time);
    Serial.print(" Operation time: ");
    time = time - loop_time;
    Serial.println(time);
} // end of setup

void loop() {}

The output of this is:

Loop baseline Completion time: 52 Completion time: 644 Operation time: 592

This essentially gives a time of 5usec for each division operation using bytes.

However, changing two loops from :

for (DATATYPE j = 56; j < 57; j++) {

to

for (DATATYPE j = 57; j < 58; j++) {

gives the following output:

Loop baseline Completion time: 52 Completion time: 108 Operation time: 56

I have tried a few different vales for the inner loop (7, 56, 58, 3) etc

3 gives a similar result, but 7, 11 and 13 do not

Can anyone tell me why there would be such a big difference in the times taken to perform the operation?

your `OPERATION` macro isn't technically correct [without parentheses](https://stackoverflow.com/q/10820340/995714) — phuclv, Sep 02 '18 at 12:09

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

If the compiler knows what the denominator in a division is, it may be able to compute the division with a multiply (by the inverse of the denominator) and a shift. Since division is really slow, even on CPUs which have divide instructions, this technique, often called reciprocal multiplication, can be a huge win.

Unfortunately, some denominators are easier to optimize than others, particularly when multiplication is restricted to 16 bits. So you're likely to see different behaviour with different constant divisors.

I gather that the compiler is not seeing the better optimisation available for that loop, which is to set a temporary to 0 and use it as the result of the division. When i reaches 56 (or 57 as the case may be), the temporary is incremented. I think this would radically reduce execution time.

Compilation test

Curiosity got the best of me so I installed the current Ubuntu avr-gcc package, which is v5.4.0, and tried some compiles. My program is much simpler than the original:

uint8_t dodiv(uint8_t d) {
  return d / DIVISOR;
}

Then I compiled it with commands like:

avr-gcc -S -Wall -O3 -o - -mmcu=atmega32 -DDIVISOR=57 avrdiv.c

As far as I know, -mmcu=atmega32 is correct for an Arduino Uno.

Here are a few results, restricted to the body of the dodiv function. I don't think it really explains the differential results in OP's test between 56 and 57, but I might be missing something. It certainly shows why 3 and 7 might have very different timings.

/* DIVISOR 3 */     /* DIVISOR 7 */     /* DIVISOR 56 */    /* DIVISOR 57 */
dodiv:              dodiv:              dodiv:              dodiv:
    ldi r25,lo8(-85)    ldi r25,lo8(37)     lsr r24             ldi r25,lo8(9)
    mul r24,r25         mul r24,r25         lsr r24             mul r24,r25
    mov r24,r1          mov r25,r1          lsr r24             mov r24,r1
    clr __zero_reg__    clr __zero_reg__    ldi r25,lo8(37)     clr __zero_reg__
    lsr r24             sub r24,r25         mul r24,r25         lsr r24
    ret                 lsr r24             mov r24,r1          ret
                        add r24,r25         clr __zero_reg__
                        lsr r24             ret                                              
                        lsr r24                                         
                        ret

Here is one explanation of the technique: http://homepage.divms.uiowa.edu/~jones/bcd/divide.html I'll search for a better one tomorrow. — rici, Sep 02 '18 at 05:59
Hi. Just to clarify my own understanding. Your saying that in the case of a denominator of 3 and 57 the compiler was able to use reciprocal (1/3 or 1/57) and multiplied/shifted for all 100 iterations (ie optimised the loop), whereas other denominators that I tried (7,11,13 etc) it used straight division. — darrob, Sep 02 '18 at 08:18
@darrob: I'd have to look at the generated code, which you could do more easily. It's also possible that it does a slower multiply (such as a 32-bit multiply). — rici, Sep 02 '18 at 11:45
note that [AVR gcc 4.6.4 is not yet smart enough](https://godbolt.org/z/Ii1FZ4) to convert from division to multiplication by a multiplicative inverse, so you'll need to do the conversion yourself using the magic number output [from here](http://www.hackersdelight.org/magic.htm) — phuclv, Sep 02 '18 at 12:08
me too. Probably some problem with CE at the moment. You can just change the compiler to AVR gcc and test `uint8_t div57(uint8_t x) { return x/57; }` — phuclv, Sep 02 '18 at 15:16
@phuclv: According to [the gcc 4.7 changelog](https://gcc.gnu.org/gcc-4.7/changes.html), division by a constant was significantly improved in avr-gcc 4.7. I believe the current avr-gcc download is v5.4, so godbolt's version is not necessarily representative. Otoh, [this bug report](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80929) says that the constant division optimisation which was working in v6.3 no longer worked in v7.1 because of a regression. — rici, Sep 02 '18 at 17:39
@rici actually there's already [AVR gcc 8.1.0](https://github.com/mattgodbolt/compiler-explorer/issues/301). I don't have any of the newer versions at hand so I can only test with CE — phuclv, Sep 03 '18 at 08:33
@phuclv: I think the bug report I mentioned above applies to avr-gcc 8.1.0; at least, it's not marked FIXED (or even assigned to anyone). But the 5.4 version is the one that showed up when I followed the download instructions at www.arduino.cc, so I guess it's roughly speaking the supported version. (Or at least there's a reasonable chance that its the version being used by OP, since they're not saying.) — rici, Sep 04 '18 at 00:21

Timing speed of math operations on Arduino - anomaly

1 Answers1

Compilation test