Does Python load in function arguments into registers or does it keep them on the stack?

Question

So I'm writing a function that takes in a tuple as an argument and does a bunch of stuff to it. Here is what that looks like:

   def swap(self, location):
    if (location[0] < 0 or location[1] < 0 or
        location[0] >= self.r or location[1] >= self.c):
        return False

    self.board[0][0] = self.board[location[0]][location[1]]
    self.board[location[0]][location[1]] = 0
    self.empty = (location[0],location[1])

I'm trying to make my code as efficient as possible, so since I am not modifying the values of location, does it make sense to load the variables in registers (loc0 = location[0]; loc1 = location[1]) for faster computations (zero-cycle read) or is location already loaded into registers by the Python compiler when it's passed in as a function argument?

Edit: I bit the bullet and ran some tests. Here are the results (in seconds) for this function running 10 million times with the repeating inputs: "up", "down", "left", "right" (respectively)

 Code as is:
   run#1: 19.39
   run#2: 17.18
   run#3: 16.85
   run#4: 16.90
   run#5: 16.74
   run#6: 16.76
   run#7: 16.94

 Code after defining location[0] and location[1] in the beginning of the function:
   run#1: 14.83
   run#2: 14.79
   run#3: 14.88
   run#4: 15.033
   run#5: 14.77
   run#6: 14.94
   run#7: 14.67

That's an average of 16% increase in performance. Definitely not insignificant for my case. Of course, this is not scientific as I need to do more tests in more environments with more inputs, but enough for my simple use case!

Times measured using Python 2.7 on a Macbook Pro (Early 2015), which has a Broadwell i5-5257U CPU (2c4t max turbo 3.1GHz, sustained 2.7GHz, 3MB L3 cache).

IDE was: PyCharm Edu 3.5.1 JRE: 1.8.0_112-release-408-b6 x86_64 JVM: OpenJDK 64-Bit Server VM .

Unfortunately, this is for a class that grades based on code speed.

If you do `loc0 = location[0]`, you are setting a variable, not a "register". — BrenBarn, Oct 10 '17 at 03:48
Which python JIT-compiler are you using? If you're using an intepreter, it's unlikely that any Python variables will live in registers between different expressions. In C, one benefit of using local variables is when there's possible aliasing. (e.g. if the compiler thinks modifying self.board[0][0] might have also modified location[0], reading it into a local would let the compiler keep it in a register.) IDK if that's a thing in Python. (I'm here because of the [tag:assembly] tag.) — Peter Cordes, Oct 10 '17 at 03:52
@BrenBarn yes, but compilers optimize defined variables that are used frequently by allocating registers to them. — JoeVictor, Oct 10 '17 at 04:17
@PeterCordes and that's why I added the tag! It turns out it probably might be a thing in Python (sorta). I ran some time tests and there's an ~18% performance boost when I define local variables for `location[0]` and `location[1]`. While I don't think Python uses aliasing, I'm pretty sure the values were allocated to registers. — JoeVictor, Oct 10 '17 at 04:19
@QuantumHoneybees: What compilers? Python doesn't compile anything (not to machine code anyway). — BrenBarn, Oct 10 '17 at 04:25
It's probably faster because it avoids the `[]` operator and other overhead (see Bren's answer, which you accepted), not because it actually keeps the value in a register. Unless you're JIT-compiling, it's not going to be worth dedicating a register to hold the value of a Python variable. For that to even be plausible, the interpreter would have to be written in asm, or with `long last_used_variable asm("rdx")` in GNU C or something. It just makes no sense (and I say this as an experienced asm programmer / optimizer who regularly looks at C compiler asm output.) — Peter Cordes, Oct 10 '17 at 04:27
@QuantumHoneybees: Also, how are you timing "this code"? Your code is referencing external values like `self.board` and `self.r`, so it's not even possible to test just the code you've shown here. — BrenBarn, Oct 10 '17 at 04:30
For your results, what Python implementation did you use? `python --version` or whatever. And what CPU did you test on? (e.g. Intel Haswell, or just say `i7-6700k` or whatever if you don't know the microarchitecture model name). Different uarches have different relative costs of instruction throughput vs. store-forwarding latency. Also different branch prediction and branch-mispredict penalty. — Peter Cordes, Oct 10 '17 at 04:30
I used Python 2.7 on a Macbook Pro (Early 2015) 2.7 GHz Intel Core i5 CPU. IDE was: PyCharm Edu 3.5.1 JRE: 1.8.0_112-release-408-b6 x86_64 JVM: OpenJDK 64-Bit Server VM — JoeVictor, Oct 10 '17 at 04:32

score 2 · Accepted Answer · answered Oct 10 '17 at 06:49

If you're using an interpreter, it's unlikely that any Python variables will live in registers between different expressions. You could look at how the Python source compiled to byte-code.

Python bytecode (the kind stored in files outside the interpreter) is stack-based (http://security.coverity.com/blog/2014/Nov/understanding-python-bytecode.html). This byte-code is then interpreted or JIT-compiled to native machine code. Regular python only interprets, so it's not plausible for it to keep python variables in machine registers across multiple statements.

An interpreter written in C might keep the top of the bytecode stack in a local variable inside an interpret loop, and the C compiler might keep that C variable in a register. So repeated use of the same Python variable might end up not having too many store/reload round-trips.

Note that store-forwarding latency on your Broadwell CPU is about 4 or 5 clock cycles, nowhere near the hundreds of cycles for a round-trip to DRAM. A store/reload doesn't even have to wait for the store to retire and commit to L1D cache; it's forwarded directly from the store buffer. Related: http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ and http://agner.org/optimize/, and other links in the x86 tag wiki). Load-use latency is also only 5 clock cycles for an L1D cache hit (latency from address being ready to data being ready. You can measure it by pointer-chasing through a linked list (in asm).) There's enough interpreter overhead (total number of instructions it runs to figure out what to do next) that this probably isn't even the bottleneck.

Keeping a specific python variable in a register is not plausible at all for an interpreter. Even if you wrote an interpreter in asm, the fundamental problem is that registers aren't addressable. An x86 add r14d, eax instruction has to have both registers hard-coded into the instruction's machine-code. (Every other ISA works the same way: register numbers are part of the machine-code for the instruction, with no indirection based on any data). Even if the interpreter did the work to figure out that it needed to "add reg-var #3 to reg-var #2" (i.e. decoding the bytecode stack operations back into register variables for an internal representation that it interprets), it would have to use a different function than any other combination of registers.

Given an integer, the only ways to get the value of the Nth register are branching to an instruction that uses that register, or storing all the registers to memory and indexing the resulting array. (Or maybe some kind of branchless compare and mask stuff).

Anyway, trying to do anything specific about this is not profitable, which is why people just write the interpreter in C and let the C compiler do a (hopefully) good job of optimizing the machine code that will actually run.

Or you write a JIT-compiler like Sun did for Java (the HotSpot VM). IDK if there are any for Python. See Does the Python 3 interpreter have a JIT feature?.

A JIT-compiler does actually turn the Python code into machine code, where register state mostly holds Python variables rather than interpreter data. Again, without a JIT compiler (or ahead-of-time compiler), "keeping variables in registers" is not a thing.

It's probably faster because it avoids the [] operator and other overhead (see Bren's answer, which you accepted)

Footnote: a couple ISAs have memory-mapped registers. e.g. AVR (8-bit RISC microcontrollers), where the chip also has built-in SRAM containing the low range of memory addresses that includes the registers. So you can do an indexed load and get register contents, but you might as well have done that on memory that wasn't holding architectural register contents.

score 1 · Answer 2 · answered Oct 10 '17 at 03:48

1

The Python VM only uses a stack to execute its bytecode, and this stack is completely independent of the hardware stack. You can use dis to disassemble your code to see how your changes affect the generated bytecode.

answered Oct 10 '17 at 03:48

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

That may be true for CPython, but it's more accurate to say that this behavior isn't specified at all. – BrenBarn Oct 10 '17 at 03:52

score 1 · Answer 3 · answered Oct 10 '17 at 03:52

1

It will be a little faster if you store these two variable:

loc0 = location[0]
loc1 = location[1]

Because there will be only two look-up instead of four.

Btw, if you want to use python, you shouldn't take care about performance in this low level.

answered Oct 10 '17 at 03:52

Sraw

18,892
11
54
87

1

Unfortunately, this is for a class that grades based on code speed. Ridiculous, I know but not much I can do unfortunately. – JoeVictor Oct 10 '17 at 04:16

BrenBarn · Answer 4 · 2017-10-10T03:58:26.213

1

Those kinds of details are not part of the specified behavior of Python. As Ignacio's answer says, CPython does it one way, but that is not guaranteed by the language itself. Python's description of what it does is very far removed from low-level notions like registers, and most of the time it's not useful to worry about how what Python does maps onto those details. Python is a high-level language whose behavior is defined in terms of high-level abstractions, akin to an API.

In any case, doing something like loc0 = language[0] in Python code has nothing to do with setting registers. It's just creating new Python name pointing an existing Python object.

That said, there is a performance difference, because if you use location[0] everywhere, the actual lookup will (or at least may -- in theory a smart Python implementation could optimize this) happen again and again every time the expression location[0] is evaluated. But if you do loc0 = location[0] and then use loc0 everywhere, you know the lookup only happens once. In typical situations (e.g., location is a Python list or dict, you're not running this code gazillions of times in a tight loop) this difference will be tiny.

edited Oct 10 '17 at 03:58

answered Oct 10 '17 at 03:52

BrenBarn

242,874
37
412
384

Is there some cost in assigning to a local variable in most / some interpreters / JIT-compilers? In C you'd get the same asm for `int tmp = a[i]; b[i] = tmp;` as you would for `b[i] = a[i];` (assuming `tmp` isn't used later, and you compile with optimization enabled of course). If there's a cost, it's probably a perf win to assign to a local if you use the value even twice, you're saying? Even if once is a perf loss? (On average, I mean, in a small to medium loop with many iterations. I'm sure you could construct cases where that's not true.) – Peter Cordes Oct 10 '17 at 03:58
@PeterCordes: Well, additional bytecodes will be generated to do the assignment. I'd be shocked to see a Python implementation where assigning to a local has a non-negligible cost. I don't what JIT compilers do; in theory they could optimize some assignments away, but I'm not sure how smart they are about that. In theory it would be a perf win to assign to a local if you use the value more than once, but it's more complicated if you need to modify the value during the computation. The performance advantage is likely to be undetectably tiny in many situations though. – BrenBarn Oct 10 '17 at 04:01
1

You can measure quite small perf differences with CPU performance counters, like one part in 100 is pretty easy to detect for small asm loops ([although even a 2% slowdown can be hard to explain from looking at the asm...](https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all/44193770#44193770)). With careful measurement, you get a signal-to-noise ratio of better than one part in 10k for microbenchmarks. I already got the point that if you really care this much about perf, python is usually the wrong language :P – Peter Cordes Oct 10 '17 at 04:16
@PeterCordes: Yeah, but what I mean is that in a real application the performance difference of that one change will be lost among all the other things that might affect performance but aren't under your control. If you're actually doing anything, there's going to be memory allocations, garbage collection, disk reads, etc.; compared to those things, the cost of assigning one local variable is going to be minuscule. – BrenBarn Oct 10 '17 at 04:19
@BrenBarn, I ran the code myself with both implementations. Full results are in the edit on my post, but basically there's an almost 18% improvement by declaring location[0] and location[1] locally! – JoeVictor Oct 10 '17 at 04:21
@QuantumHoneybees: Not seeing any edit on your post. I would be surprised at that magnitude of (reliable) gain from that change alone. – BrenBarn Oct 10 '17 at 04:25
@BrenBarn just edited. I ran more precise calculations and it's closer to 16%, but still though, not insignificant. – JoeVictor Oct 10 '17 at 04:27
@BrenBarn it's also important to note that my results aren't scientific in any way. These are just for my specific use case – JoeVictor Oct 10 '17 at 04:28
@BrenBarn: Sounds like it can make a significant difference for a CPU-bound tight loop. That's not really surprising. Agreed that most "real" applications aren't like that, and spend a lot of their time on disk or other I/O, and memory bottlenecks. Running a tight loop for a very long time that doesn't bottleneck on memory is a very rare case, outside of HPC number crunching, or other stuff that you would not normally write in Python! – Peter Cordes Oct 10 '17 at 04:35

Does Python load in function arguments into registers or does it keep them on the stack?

4 Answers4

Linked