JRoelofs Image
Recently a colleague and I were discussing some of the nuances of the ARM
Procedure Call Standard, related to some work I did adding armv4t support to
LLVM.
First, let’s take a step back and review what a “Calling Convention” is, and
lay the ground rules for them on more modern ARM architectures before jumping
into the weeds with armv4t. Calling conventions are a low-level contract
describing how arguments may be passed to function calls, how return values are
passed back, and set the division of responsibilities between caller and callee
over saving and restoring registers before/after/during function calls.
ARM EABI has a particularly straightforward calling convention for C functions
that denotes (ignoring VFP registers):
r0
-r3
as caller-saved argument / scratch registers, with results passed in r0
-r1
r4
-r8
as callee-saved registers
r9
is either caller or callee saved, depending on the target platform
r10
-r11
are callee-saved
r12
-r15
are special-purpose registers (ip, fp, lr, pc)
Callee-saved registers must be saved by the called function, and caller-saved
registers must be saved by the calling function. This particular allocation of
a small number of caller-saved registers, with the bulk of the register set
allocated as callee-saved registers favors small functions with fewer live
ranges (and therefore less register pressure). This seems to be a good
tradeoff since it allows small functions to avoid touching the stack to spill
live ranges, and leaves a big set of registers for use in large functions where
the cost of saving & restoring them is better amortized over the life of the
function.
In practice, a C function like this:
int melanger(int lhs, int rhs) {
return lhs + rhs
}
will most likely get lowered to something like:
melanger:
add r0, r0, r1
bx lr
Of note here is that the compiler passed lhs
and rhs
in r0
and r1
respectively, and the result was returned in r0
. What’s also interesting is
the bx
instruction used for the jump back to the caller in that it allows
“interworking” between the two instruction sets, Arm and Thumb, that most
32-bit ARM cores have on them. At branch time, depending on the parity of the
pc
stored in lr
, the chip works out whether it needs to switch modes
between Arm and Thumb to correctly decode the instruction stream. A fun
side-effect of that is that all Thumb functions have odd addresses, and Arm
functions have even addresses.
The call sites of this function will look something like this in the case of a
direct call:
mov r0, ...
mov r1, ...
blx melanger
mov ..., r0
Indirect calls (ie. through a function pointer) look a bit different:
mov r0, #1
mov r1, #2
ldr r3, .funptr
ldr r3, [r3]
blx r3
mov ..., r0
And that brings me to one of the first weirditudes of armv4: there is no blx
instruction, so indirect calls must use some other sequence. The most obvious
candidate would be mov lr, pc; blx rN
, but that has a subtle incompatiblity
if you attempt to run such code on a newer ARM core, since the mov
“magically” drops the parity bit, causing the instructions after the bx
to
be interpreted as Arm instructions… which whole rest of the toolchain isn’t
prepared for.
Instead, we take advantage of the fact that the bl
instruction saves off the
link register, and create a sort of “branch island” to take care
of the interworking bits:
mov r0, #1
mov r1, #2
ldr r3, .funptr
bl .island
mov ..., r0
...
.island:
bx r3
This avoids the broken sequence, and maximizes compatibility with a bunch of
early ARM architecture variants. Iain Sandoe and I fixed that in
r223380
if you’d like to take a look at the implementation details.
Another place that armv4 gets weird is during function prologues, which would
normally leverage pop {..., pc}
to restore various callee-saved registers,
and jump back to the caller. Unfortunately on armv4, the pop
instruction
doesn’t restore the parity bit, so interworking through it is not permitted.
Instead, we have to return through bx
like so:
pop {r3}
add sp, #offset # pop off whatever other stack we used
bx r3
Things get even more complicated if the function needs to use r3
as part of
the returned values, since now that hinders our ability to use that register to
help with interworking. It is further complicated by the fact that there
aren’t many other registers that we can even use for this. Any callee-saved
register is obviously out of the question, since we need to use it after
we’ve restored them. That limits us to a caller-saved one, or one of those
special-purpose registers mentioned earlier. The best candidate is therefore
the ip
register, i.e. r12
, which is reserved for the linker as a scratch
register for use between function calls, and in that case we lower the prologue as:
mov ip, r3 # shuffle part of return value into ip
pop {r3} # lr from the call site
add sp, #offset # pop off whatever other stack we used
mov lr, r3 # shuffle return address into lr
mov r3, ip # restore r3 as part of return value
bx lr # return to caller
For those curious on the implementation details, I implemented this workaround
in r214881.