Getting to Deterministic Builds on Windows

Posted on Apr 30, 2020

(Disclaimer: Some of this post discusses projects from my job. All opinions and mistakes here are my own.)

This is a set of notes on getting to deterministic builds in C, C++ and Rust on Windows.

The primary motivation for this is not the lofty goal of a Reproducible Build, but simply improving our Bazel cache hit rates.

A Quick primer on Bazel caching

At Dropbox, much of our build is powered by Bazel and I was involved in making that a reality. One of the core benefits of Bazel is that once you buy into the model, you get remote caching for free. This means a local developer can benefit from the thousands of hours that CI machines spend cranking on the build, and just pull down those artifacts instead of waiting several minutes for full local rebuild.

The Bazel cache works at the action level, where an action is usually a unique command run that produces some outputs from some inputs. Bazel calculates checksums of all inputs and outputs and uses this to influence decisions about when to use the cache. For a given action, if all your local input hashes match the hashes in cache, Bazel can re-use the output from the cache instead of rebuilding it.

In such a model, we still want two things¹ to be true.

Correctness

If the build is going to re-use outputs when inputs are the same, we want to make sure that our compilers and other tools actually produce the same outputs for the same inputs. What “same” means here lies on a range, because not all tools are designed for this. At the lowest level you at least want these to be functionally identical. That is, say you have inputs A and B, and a tool, represented by a pure function F(inputs...) -> output, where output is some executable.

hash(A): foxtrot
hash(B): tango

# First build
F(A, B) -> C
hash(C) -> whiskey

# Second build
F(A, B) -> D
hash(D): romeo

Imagine we execute C and D and they produce different results! Or, say C is a debug mode executable and D is one with optimizations, and so one runs faster than the other. This is incorrect! Our function F did not produce the same outputs for the same inputs, which means F itself has some implicit configuration or state that is changing behavior. In the context of build systems, this usually means the build system didn’t treat the command line and compiler flags used as “inputs”. Bazel goes to great lengths in the build description to force you to very pedantically describe all these things, so that it can track all of it as inputs to monitor. A combination of toolchains, well defined inputs and outputs and sandboxing is used to enforce this. This usually means we don’t have to deal with such egregious differences of correctness. That is, it is acceptable to have the same inputs produce slightly different hashes. This will manifest in two ways. One, Bazel will use the cache, see the inputs are the same, and instead of producing D (hash: romeo), just gets C (hash: whiskey) from the cache. This is usually OK, but we should try to minimize it.

The other way it can go is it decides to build locally, gets D and now everything that depends on D is affected, slowing the local build.

Speed

In terms of build system classification, Bazel has a rebuilder that uses constructive traces to track the build. Roughly this means for a given action, only its immediate inputs affect the caching. So if we had a common 2 step process for a C library:

$ gcc -c foo.o foo.c
$ gcc -o foo foo.o

Which is represented in Bazel roughly as:

cc_binary(
  name = "foo",
  srcs = ["foo.c"],
)

Bazel will have 2 actions:

compile action:
  inputs: foo.c, hash: charlie
  outputs: foo.o, hash: delta

link action:
  inputs: foo.o, hash: delta
  outputs: foo, hash: echo

Say we have a hypothetical compiler that produces a different foo.o for the same foo.c. That is, the hashes don’t match even though they behave identically.

CI machine:
  compile action:
    inputs: foo.c, hash: charlie
    outputs: foo.o, hash: delta
  
  link action:
    inputs: foo.o, hash: delta
    outputs: foo, hash: echo

shared cache now has keys: {charlie, delta}

Developer machine:
  compile action:
    inputs: foo.c, hash: charlie
    outputs: foo.o, hash: november (!)
  
  link action:
    inputs: foo.o, hash: november  (cache miss)
    outputs: foo, hash: zulu       (local build)

That is a pickle! Say Bazel’s heuristics decide to compile locally instead of hitting the cache. We end up with foo.o hashing to november, which is not in the cache from the CI machine, so Bazel is forced to also run the link step locally. You can see how this can spread. In large builds, you could easily have tens of C files per library, some of which are built locally, some pulled from the cache and every time hash mismatches happen, the build system is forced to start from scratch down the entire build graph!

Well, that hypothetical compiler exists and it is called the Microsoft Visual Studio C/C++ compiler ².

What we really want is for every command we run, to produce truly identical outputs so we get maximum cache hits. The fact that identical hashes do also mean truly reproducible builds is a nice side-effect.

Other caching tools like ccache use similar algorithms.

Now, everything I’ve talked about so far is well known in certain circles. Let’s talk about how to actually address this on Windows!

Fix Date and time macros

The C standard provides two macros __DATE__ and __TIME__ that are set to the time of compile. If a source file actually uses that to set variables or in strings, that is terrible for our purposes because every compiler invocation will lead to new values. We are forced to break these. I don’t really know of any libraries that use this to affect behavior.

For MSVC, we can override the macro definitions:

cl.exe /D__DATE__=CONSTANT /D__TIME__=CONSTANT …

Fix dates and times in Portable Executables

In addition, the PE format used by Windows for executables and DLLs has file headers that have a prominent TimeDatestamp field. This is inserted by the linker. There is an undocumented flag - /Brepro that causes the linker to put a fixed value in this field.

link.exe /Brepro …

There are a few more places in a PE that have timestamps. This includes the IMAGE_EXPORT_DIRECTORY and IMAGE_RESOURCE_DIRECTORY structures.

Fortunately, there is a nice tool called ducible that can be run on the file after linking to rewrite these bits with constant values.

Integrating ducible with your build process can be a little involved. The nicest way I know for Bazel is to tweak the toolchain definition. This has the nice property of not requiring the rest of your build to know about ducible. Even custom rules that leverage cc_common to create actions will automatically benefit. If you are using a fixed toolchain configuration inspired by this Bazel example, you should replace the linker path with a custom target, instead of a direct path to link.exe. This can be a batch script or similar that forwards all the linker options to link.exe, then runs ducible transparently before exiting. This is not the approach we use internally, since I only thought about it recently. Instead we have the few places that require this be aware of ducible and add it to the Bazel actions or stick it into tool wrappers. That is, our rust compiler isn’t a direct call to rustc, but a wrapper script that does a bunch of things, one of which is running ducible.

Disable incremental builds

MSVC has an incremental build mode, where the linker adds extra information to files to allow it to build faster on subsequent runs. This causes changes in the hash. Since our build is powered by Bazel, incremental builds also don’t really help us. Disable this.

Deal with the PDBs

Program Database files (PDBs) are the real bane of our quest. We need them to debug software without shipping debug information to users, so we cannot simply disable the production of these files. Everything I’ve discussed above is relatively well known, but this I had to discover for myself.

PDBs have several problems:

A PDB and a PE are linked so debuggers know exactly which PDB to use from a symbol server. PE file have a debug section that has a timestamp and a signature to identify the PDB. This changes every time. This is fixed by ducible.
PE files encode the absolute path to the PDB by default.
PDB files contain absolute paths to all the resources involved, such as object files. This means we need to ensure all paths are the same on every machine.
Since PEs and PDBs are linked, changes to the PDB always change the PE. This means we need identical PDBs.
PDB files have a build identifier, so even minor version updates in MSVC lead to hash differences.

Fix the paths

To fix #2, we can pass the /PDBALTPATH:%_PDB% flag to link.exe. This makes the linker encode just the filename instead of an absolute path.

To fix #3 and #4, we need to control where build trees are located on all machines. The first problem is that developers will naturally have files in different locations because the typical storage location is somewhere in C:\Users\<user name> and <user name> is unique. If you are doing in-source builds, this will be a problem. If you are doing out-of-source builds, this is easier to fix by having your build system use a well-known location like C:\build instead of C:\Users\nikhil\path\to\repo\build. This way all absolute paths start as C:\build\... on any machine, fixing #3 to some extent.

Fortunately, Bazel always does out of source builds so we can control where to place the build tree. In addition, external libraries and resources are also something to take care of. To take care of both of these we need to change Bazel’s output user root and output base, both of which are usually calculated based on hashes of things, and stored in the user home directory. The full details are on their website. Something like --output_user_root=C:/bazel and --output_base=C:/bazel/base is a good start ³.

The other source of changing paths is temporary files. Depending on your specific build steps, it is very common to run tools that build something in a temporary directory. You will need to change this to use a deterministic location. Using the hash of some set of inputs is a good way to go. We often use a hash of the repository relative location of files. There are some sources of temporary files you cannot easily remove, such as this one I found in rustc.

With all these changes you should be getting deterministic PEs and PDBs as far as input files go.

Pin the build version

The final bit I noticed was PDBs have a major and minor build number in the DBI stream header. This can change within different versions of the same major toolchain like Visual Studio 2017, so it is important to have CI machines and developer machines on the same update cycle. Some automated deployment stragegy helps here so every developer does not have to remember to manually update their installation.

As far as I’m aware this leads to bit-identical PEs and PDBs in my tests so far!

Useful tools

pdbdump - Shipped as part of ducible, pdbdump is great for seeing the metadata in a pdb. This was what helped me see absolute paths in the PDB.
dumpbin - Part of MSVC. dumpbin /all run on an EXE or DLL is a good way to see how headers and other metadata differ.
A binary diff tool such as vbindiff. After you’ve exhausted all the metadata approaches, sometimes you just have to jump into hex and looking at individual bytes. This is a small and free tool. There is also a limited, web version of diffoscope, but that tool it self does not work on Windows.