Why does my stack have an extra 4 bytes? Digging into Clang's return value implementation
I was futzing around with some C code a few days ago and noticed that
executables generated by Clang would sometimes have an
extra 4 bytes on the stack. This was just for the main
function. We can
verify this is Compiler Explorer. Try switching
to GCC and this doesn’t happen. This was interesting, so
I spent a few hours over the holiday digging into why and how this happens. For
the remainder of this post we will be using these 3 examples, which you can play
with in Compiler Explorer.
The terminology will become clear as we go along. This is my first time digging
into LLVM’s sources, and my first time reading about SSA form. I’m going to do
a lot of hand-waving in pursuit of the goal. To follow along in LLVM IR, you
can add -emit-llvm
to the “Compiler options…” text field.
So where do we start from? Some preliminary Google searching only revealed this mailing list post.1 Of course, when I started down this rabbit-hole, I only had the dominating multiple store example, since that is the canonical way to write main() in modern C. So the mailing list reply didn’t make complete sense.
Yes. We allocate an implicit local variable to hold the return value; return statements then just initialize the return slot and jump to the epilogue, where the slot is loaded and returned.
Clearly the slot was being allocated (dword ptr [rbp - 4]
is the slot,
dword ptr [rbp - 8]
is int s
), but the function epilogue was never loading
it into eax
where the return value is expected.
If we go to dominating single store next, the slot is gone!
Finally, in the non-dominating store the slot is present and it is actually used in the epilogue, with both return branches setting it to the requisite value. This is fascinating!
Reserving some space
Why does the first example have stack space, but not use it, while the second doesn’t use it at all, and the third uses it as we expect?
Let’s track down the place the slot is actually reserved. This is done while emitting the function
prolog,
If the function declaration’s HasImplicitReturnZero
attribute is set,
a store
is generated that puts Zero
in ReturnValue
.
This is responsible for the LLVM IR:
store i32 0, i32* %1, align 4
or the corresponding assembly
mov dword ptr [rbp - 4], 0
Enforcing the standards
The HasImplicitReturnZero
attribute is set in Clang’s Sema component. It
seems to be in charge of enforcing various C/C++ spec semantics. One of the
rules is that the main
function must have a return value of int
, and if the
code falls through to the end, this return value must be zero. This is handled
here.
Pruning in the epilogue
The prologue generator unconditionally adds the return value slot to main()
.
Where does it go? Well, the epilogue generation performs some analysis and
removes it. This stage explains all the differences we are seeing in our three
samples. Essentially, Clang is doing some of its own optimizations before LLVM
kicks in. I’m not sure why. Here is code for the cases we care
about:
// If there is a dominating store to ReturnValue, we can elide
// the load, zap the store, and usually zap the alloca.
if (llvm::StoreInst *SI =
findDominatingStoreToReturnValue(*this)) {
// Reuse the debug location from the store unless there is
// cleanup code to be emitted between the store and return
// instruction.
if (EmitRetDbgLoc && !AutoreleaseResult)
RetDbgLoc = SI->getDebugLoc();
// Get the stored value and nuke the now-dead store.
RV = SI->getValueOperand();
SI->eraseFromParent();
// If that was the only use of the return value, nuke it as well now.
auto returnValueInst = ReturnValue.getPointer();
if (returnValueInst->use_empty()) {
if (auto alloca = dyn_cast<llvm::AllocaInst>(returnValueInst)) {
alloca->eraseFromParent();
ReturnValue = Address::invalid();
}
}
// Otherwise, we have to do a simple load.
} else {
RV = Builder.CreateLoad(ReturnValue);
}
Hmm… lots of zapping here. Let’s think through how this applies to our 3 examples. First, we attempt to find a “dominating store” for the return value. The implementation here is very basic, and I go into detail further in this post. We want to find a store that will definitely be executed on every run of the function, and will be the last store to modify the return value slot in this function. Clang won’t let us see the intermediate steps, but here is how the LLVM IR would look2 before epilogue generation happens.
Example 1:
%1 = alloca i32, align 4 ; ReturnValue
%2 = alloca i32, align 4 ; int s
store i32 0, i32* %1, align 4
; prolog ends
...
store i32 0, i32* %1, align 4 ; return 0
Example 2:
%1 = alloca i32, align 4
%2 = alloca i32, align 4
store i32 0, i32* %1, align 4
; prolog ends
Example 3:
%1 = alloca i32, align 4
%2 = alloca i32, align 4
store i32 0, i32* %1, align 4
; prolog ends
...
; if (condition) return 1;
store i32 1, i32* %1, align 4
; jmp over the store below
store i32 0, i32* %1, align 4 ; return 0
; epilog
Let’s deal with example 3 which goes to the simpler else
branch. Clang can’t
find a dominating store instruction because both branches modify the return
Clang slot. Instead we load the return value slot into register %8 and return
the register.
; epilog
%8 = load i32, i32* %1, align 4,
ret i32 %8, !dbg !21
This translates to assembly as
mov eax, dword ptr [rbp - 4]
In examples 1 and 2, because there are no branches, we find a dominating store
instruction; the one that is sequentially latest. We retrieve the value that
store was attempting to write to the slot, and set that as the return value
instead. In assembly, this will map to loading straight into eax
. Then, in
both examples, the dominating store is erased from the IR. In example 1 this is
the store for the explicit return. In example 2 this is the implicit store from
the prolog.
The ReturnValue users are now down to zero in example 2. So we find the initial
alloca
instruction and nuke that too. That is why our IR looks the way it
does. In example 1, the leftover stack allocation and store is completely
useless (since we already replaced the real return value with the dominating
store), but since the use count was non-zero, Clang didn’t nuke the allocation,
and LLVM faithfully preserved it.
That explains why we end up with these 3 variations, but it seems very
inelegant with the wasted stack space. Why not track the return statements and
directly use the one that matters? What is this “dominating store”? What is
this talk of phi in the second part of the cfe-dev
reply?
We don’t use a phi because the control flow for getting to the epilogue is not necessarily as simple as a simple branch, due to cleanups in local scopes (like C++ destructors).
SSA form and Phi values
Static Single Assignment form is a style of intermediate representation where every variable can only be assigned once. So compilers create new registers/variables for every time they want to update an existing one. In sequential execution it is easy to see which one of these is the latest value. Of course, as the Wikipedia example shows, this breaks down at conditionals. Phi is a special instruction equivalent to delayed evaluation. It captures both branches, and the generated assembly simply generates labels and branches to capture that in reality.
This seems like what we want. Let the return value be captured by introducing additional registers, with phi values capturing branches, and let the register allocator deal with bringing the register count back to reality!
Making trade-offs
But we don’t have any phis! Apparently Clang front-ends should just
use stack memory (obtained by
using the alloca
instruction) and let LLVM’s mem2reg
pass promote stack
values to registers and insert phi nodes as necessary. So Clang avoids using
phi nodes (although they do get generated for
boolean conditionals). The mailing list reply is clarifying that just using
a stack memory location particularly simplifies epilogues, where the C++
standard may specify cleanup that could complicate things. I’m not sure exactly
how this would happen. If you do, get in touch!
Of course, in the default non-optimized builds, LLVM doesn’t try to eliminate
this allocation. We are stuck with the return value slot.
To minimize the “damage”, Clang tries to clean things up in the easier cases.
I won’t go into the details of LLVM’s blocks and SSA implementation; that is
covered
reasonably
in other places.
findDominatingStoreToReturnValue
first checks if the return value was referenced by multiple instructions. If it
was, Clang picks the last instruction in the current basic block (insertion
point)
and picks that if and only if it is a store. Otherwise, there is only one
instruction referencing the return value. If it is a store and it is possible
to reach unconditionally and sequentially from that point to the current
insertion point (which is the end of the function), then that store is
definitely the last access and is the dominating store. Clang is being quick
and dirty here. Instead of trying to do a full Phi reduction in cases where
there may be branches, it just gives up and eats the memory cost of the stack
store. This falls out from the semantics of getSinglePredecessor()
which
will return NULL
if more than one blocks lead to the current block.
All of this makes sense. Clang is not a full compiler and isn’t putting in
effort. Can LLVM do better? Try adding -O1
to the compiler options for the
first example. LLVM successfully eliminates the extra memory!3 😀
I love these kind of small technical details, because they are engineering solutions that combine parts of theoretical CS, with engineering restrictions like x86 calling conventions and C standards. This slot is out there in thousands of executables produced by Clang, staying in the shadows until duty calls!
-
cfe
presumably stands for Clang Front-end. This is because Clang is a “front-end” to the LLVM compiler toolchain. It is capable of converting C/C++/Objective-C to LLVM Intermediate Representation, after which LLVM’s optimization passes and platform specific backends can translate this to x86 executables. ↩︎ -
Return statement codegen for scalars is handled here. ↩︎
-
In fact, Clang’s
-O1
also runs constant propagation. In example 3, it determines that thereturn 1
case will always be executed and eliminates all load and stores! ↩︎