Why does my stack have an extra 4 bytes? Digging into Clang's return value implementation

I was futzing around with some C code a few days ago and noticed that executables generated by Clang would sometimes have an extra 4 bytes on the stack. This was just for the main function. We can verify this is Compiler Explorer. Try switching to GCC and this doesn’t happen. This was interesting, so I spent a few hours over the holiday digging into why and how this happens. For the remainder of this post we will be using these 3 examples, which you can play with in Compiler Explorer.

  1. Dominating multiple store

  2. Dominating single store

  3. Non dominating store

The terminology will become clear as we go along. This is my first time digging into LLVM’s sources, and my first time reading about SSA form. I’m going to do a lot of hand-waving in pursuit of the goal. To follow along in LLVM IR, you can add -emit-llvm to the “Compiler options…” text field.

So where do we start from? Some preliminary Google searching only revealed this mailing list post.1 Of course, when I started down this rabbit-hole, I only had the dominating multiple store example, since that is the canonical way to write main() in modern C. So the mailing list reply didn’t make complete sense.

Yes. We allocate an implicit local variable to hold the return value; return statements then just initialize the return slot and jump to the epilogue, where the slot is loaded and returned.

Clearly the slot was being allocated (dword ptr [rbp - 4] is the slot, dword ptr [rbp - 8] is int s), but the function epilogue was never loading it into eax where the return value is expected.

If we go to dominating single store next, the slot is gone!

Finally, in the non-dominating store the slot is present and it is actually used in the epilogue, with both return branches setting it to the requisite value. This is fascinating!

Reserving some space

Why does the first example have stack space, but not use it, while the second doesn’t use it at all, and the third uses it as we expect?

Let’s track down the place the slot is actually reserved. This is done while emitting the function prolog, If the function declaration’s HasImplicitReturnZero attribute is set, a store is generated that puts Zero in ReturnValue. This is responsible for the LLVM IR:

store i32 0, i32* %1, align 4

or the corresponding assembly

mov dword ptr [rbp - 4], 0

Enforcing the standards

The HasImplicitReturnZero attribute is set in Clang’s Sema component. It seems to be in charge of enforcing various C/C++ spec semantics. One of the rules is that the main function must have a return value of int, and if the code falls through to the end, this return value must be zero. This is handled here.

Pruning in the epilogue

The prologue generator unconditionally adds the return value slot to main(). Where does it go? Well, the epilogue generation performs some analysis and removes it. This stage explains all the differences we are seeing in our three samples. Essentially, Clang is doing some of its own optimizations before LLVM kicks in. I’m not sure why. Here is code for the cases we care about:

// If there is a dominating store to ReturnValue, we can elide
// the load, zap the store, and usually zap the alloca.
if (llvm::StoreInst *SI =
        findDominatingStoreToReturnValue(*this)) {
  // Reuse the debug location from the store unless there is
  // cleanup code to be emitted between the store and return
  // instruction.
  if (EmitRetDbgLoc && !AutoreleaseResult)
    RetDbgLoc = SI->getDebugLoc();
  // Get the stored value and nuke the now-dead store.
  RV = SI->getValueOperand();
  SI->eraseFromParent();

  // If that was the only use of the return value, nuke it as well now.
  auto returnValueInst = ReturnValue.getPointer();
  if (returnValueInst->use_empty()) {
    if (auto alloca = dyn_cast<llvm::AllocaInst>(returnValueInst)) {
      alloca->eraseFromParent();
      ReturnValue = Address::invalid();
    }
  }

// Otherwise, we have to do a simple load.
} else {
  RV = Builder.CreateLoad(ReturnValue);
}

Hmm… lots of zapping here. Let’s think through how this applies to our 3 examples. First, we attempt to find a “dominating store” for the return value. The implementation here is very basic, and I go into detail further in this post. We want to find a store that will definitely be executed on every run of the function, and will be the last store to modify the return value slot in this function. Clang won’t let us see the intermediate steps, but here is how the LLVM IR would look2 before epilogue generation happens.

Example 1:

%1 = alloca i32, align 4 ; ReturnValue
%2 = alloca i32, align 4 ; int s
store i32 0, i32* %1, align 4
; prolog ends
...
store i32 0, i32* %1, align 4 ; return 0

Example 2:

%1 = alloca i32, align 4
%2 = alloca i32, align 4
store i32 0, i32* %1, align 4
; prolog ends

Example 3:

%1 = alloca i32, align 4
%2 = alloca i32, align 4
store i32 0, i32* %1, align 4
; prolog ends
...
; if (condition) return 1;
store i32 1, i32* %1, align 4
; jmp over the store below
store i32 0, i32* %1, align 4 ; return 0
; epilog

Let’s deal with example 3 which goes to the simpler else branch. Clang can’t find a dominating store instruction because both branches modify the return Clang slot. Instead we load the return value slot into register %8 and return the register.

; epilog
%8 = load i32, i32* %1, align 4,
ret i32 %8, !dbg !21

This translates to assembly as

mov eax, dword ptr [rbp - 4]

In examples 1 and 2, because there are no branches, we find a dominating store instruction; the one that is sequentially latest. We retrieve the value that store was attempting to write to the slot, and set that as the return value instead. In assembly, this will map to loading straight into eax. Then, in both examples, the dominating store is erased from the IR. In example 1 this is the store for the explicit return. In example 2 this is the implicit store from the prolog.

The ReturnValue users are now down to zero in example 2. So we find the initial alloca instruction and nuke that too. That is why our IR looks the way it does. In example 1, the leftover stack allocation and store is completely useless (since we already replaced the real return value with the dominating store), but since the use count was non-zero, Clang didn’t nuke the allocation, and LLVM faithfully preserved it.

That explains why we end up with these 3 variations, but it seems very inelegant with the wasted stack space. Why not track the return statements and directly use the one that matters? What is this “dominating store”? What is this talk of phi in the second part of the cfe-dev reply?

We don’t use a phi because the control flow for getting to the epilogue is not necessarily as simple as a simple branch, due to cleanups in local scopes (like C++ destructors).

SSA form and Phi values

Static Single Assignment form is a style of intermediate representation where every variable can only be assigned once. So compilers create new registers/variables for every time they want to update an existing one. In sequential execution it is easy to see which one of these is the latest value. Of course, as the Wikipedia example shows, this breaks down at conditionals. Phi is a special instruction equivalent to delayed evaluation. It captures both branches, and the generated assembly simply generates labels and branches to capture that in reality.

This seems like what we want. Let the return value be captured by introducing additional registers, with phi values capturing branches, and let the register allocator deal with bringing the register count back to reality!

Making trade-offs

But we don’t have any phis! Apparently Clang front-ends should just use stack memory (obtained by using the alloca instruction) and let LLVM’s mem2reg pass promote stack values to registers and insert phi nodes as necessary. So Clang avoids using phi nodes (although they do get generated for boolean conditionals). The mailing list reply is clarifying that just using a stack memory location particularly simplifies epilogues, where the C++ standard may specify cleanup that could complicate things. I’m not sure exactly how this would happen. If you do, get in touch! Of course, in the default non-optimized builds, LLVM doesn’t try to eliminate this allocation. We are stuck with the return value slot.

To minimize the “damage”, Clang tries to clean things up in the easier cases. I won’t go into the details of LLVM’s blocks and SSA implementation; that is covered reasonably in other places. findDominatingStoreToReturnValue first checks if the return value was referenced by multiple instructions. If it was, Clang picks the last instruction in the current basic block (insertion point) and picks that if and only if it is a store. Otherwise, there is only one instruction referencing the return value. If it is a store and it is possible to reach unconditionally and sequentially from that point to the current insertion point (which is the end of the function), then that store is definitely the last access and is the dominating store. Clang is being quick and dirty here. Instead of trying to do a full Phi reduction in cases where there may be branches, it just gives up and eats the memory cost of the stack store. This falls out from the semantics of getSinglePredecessor() which will return NULL if more than one blocks lead to the current block.

All of this makes sense. Clang is not a full compiler and isn’t putting in effort. Can LLVM do better? Try adding -O1 to the compiler options for the first example. LLVM successfully eliminates the extra memory!3 😀

I love these kind of small technical details, because they are engineering solutions that combine parts of theoretical CS, with engineering restrictions like x86 calling conventions and C standards. This slot is out there in thousands of executables produced by Clang, staying in the shadows until duty calls!


  1. cfe presumably stands for Clang Front-end. This is because Clang is a “front-end” to the LLVM compiler toolchain. It is capable of converting C/C++/Objective-C to LLVM Intermediate Representation, after which LLVM’s optimization passes and platform specific backends can translate this to x86 executables. [return]
  2. Return statement codegen for scalars is handled here. [return]
  3. In fact, Clang’s -O1 also runs constant propagation. In example 3, it determines that the return 1 case will always be executed and eliminates all load and stores! [return]