Unconventional Calling

JRoelofs Image

Recently a colleague and I were discussing some of the nuances of the ARM Procedure Call Standard, related to some work I did adding armv4t support to LLVM.

First, let’s take a step back and review what a “Calling Convention” is, and lay the ground rules for them on more modern ARM architectures before jumping into the weeds with armv4t. Calling conventions are a low-level contract describing how arguments may be passed to function calls, how return values are passed back, and set the division of responsibilities between caller and callee over saving and restoring registers before/after/during function calls.

ARM EABI has a particularly straightforward calling convention for C functions that denotes (ignoring VFP registers):

  • r0-r3 as caller-saved argument / scratch registers, with results passed in r0-r1
  • r4-r8 as callee-saved registers
  • r9 is either caller or callee saved, depending on the target platform
  • r10-r11 are callee-saved
  • r12-r15 are special-purpose registers (ip, fp, lr, pc)

Callee-saved registers must be saved by the called function, and caller-saved registers must be saved by the calling function. This particular allocation of a small number of caller-saved registers, with the bulk of the register set allocated as callee-saved registers favors small functions with fewer live ranges (and therefore less register pressure). This seems to be a good tradeoff since it allows small functions to avoid touching the stack to spill live ranges, and leaves a big set of registers for use in large functions where the cost of saving & restoring them is better amortized over the life of the function.

In practice, a C function like this:

int melanger(int lhs, int rhs) {
    return lhs + rhs

will most likely get lowered to something like:

    add r0, r0, r1
    bx lr

Of note here is that the compiler passed lhs and rhs in r0 and r1 respectively, and the result was returned in r0. What’s also interesting is the bx instruction used for the jump back to the caller in that it allows “interworking” between the two instruction sets, Arm and Thumb, that most 32-bit ARM cores have on them. At branch time, depending on the parity of the pc stored in lr, the chip works out whether it needs to switch modes between Arm and Thumb to correctly decode the instruction stream. A fun side-effect of that is that all Thumb functions have odd addresses, and Arm functions have even addresses.

The call sites of this function will look something like this in the case of a direct call:

    mov r0, ...
    mov r1, ...
    blx melanger
    mov ..., r0

Indirect calls (ie. through a function pointer) look a bit different:

    mov r0, #1
    mov r1, #2
    ldr r3, .funptr
    ldr r3, [r3]
    blx r3
    mov ..., r0

And that brings me to one of the first weirditudes of armv4: there is no blx instruction, so indirect calls must use some other sequence. The most obvious candidate would be mov lr, pc; blx rN, but that has a subtle incompatiblity if you attempt to run such code on a newer ARM core, since the mov “magically” drops the parity bit, causing the instructions after the bx to be interpreted as Arm instructions… which whole rest of the toolchain isn’t prepared for.

Instead, we take advantage of the fact that the bl instruction saves off the link register, and create a sort of “branch island” to take care of the interworking bits:

    mov r0, #1
    mov r1, #2
    ldr r3, .funptr
    bl .island
    mov ..., r0


    bx r3

This avoids the broken sequence, and maximizes compatibility with a bunch of early ARM architecture variants. Iain Sandoe and I fixed that in r223380 if you’d like to take a look at the implementation details.

Another place that armv4 gets weird is during function prologues, which would normally leverage pop {..., pc} to restore various callee-saved registers, and jump back to the caller. Unfortunately on armv4, the pop instruction doesn’t restore the parity bit, so interworking through it is not permitted. Instead, we have to return through bx like so:

    pop {r3}
    add sp, #offset      # pop off whatever other stack we used
    bx r3

Things get even more complicated if the function needs to use r3 as part of the returned values, since now that hinders our ability to use that register to help with interworking. It is further complicated by the fact that there aren’t many other registers that we can even use for this. Any callee-saved register is obviously out of the question, since we need to use it after we’ve restored them. That limits us to a caller-saved one, or one of those special-purpose registers mentioned earlier. The best candidate is therefore the ip register, i.e. r12, which is reserved for the linker as a scratch register for use between function calls, and in that case we lower the prologue as:

    mov ip, r3           # shuffle part of return value into ip
    pop {r3}             # lr from the call site
    add sp, #offset      # pop off whatever other stack we used
    mov lr, r3           # shuffle return address into lr
    mov r3, ip           # restore r3 as part of return value
    bx lr                # return to caller

For those curious on the implementation details, I implemented this workaround in r214881.