Getting to Deterministic Builds on Windows
(Disclaimer: Some of this post discusses projects from my job. All opinions and mistakes here are my own.)
This is a set of notes on getting to deterministic builds in C, C++ and Rust on Windows.
The primary motivation for this is not the lofty goal of a Reproducible Build, but simply improving our Bazel cache hit rates.
A Quick primer on Bazel caching
At Dropbox, much of our build is powered by Bazel and I was involved in making that a reality. One of the core benefits of Bazel is that once you buy into the model, you get remote caching for free. This means a local developer can benefit from the thousands of hours that CI machines spend cranking on the build, and just pull down those artifacts instead of waiting several minutes for full local rebuild.
The Bazel cache works at the action level, where an action is usually a unique command run that produces some outputs from some inputs. Bazel calculates checksums of all inputs and outputs and uses this to influence decisions about when to use the cache. For a given action, if all your local input hashes match the hashes in cache, Bazel can re-use the output from the cache instead of rebuilding it.
In such a model, we still want two things1 to be true.
Correctness
If the build is going to re-use outputs when inputs are the same, we want to
make sure that our compilers and other tools actually produce the same outputs
for the same inputs. What “same” means here lies on a range, because not all
tools are designed for this. At the lowest level you at least want these to
be functionally identical. That is, say you have inputs A and B, and a tool,
represented by a pure function F(inputs...) -> output
, where output is some
executable.
hash(A): foxtrot
hash(B): tango
# First build
F(A, B) -> C
hash(C) -> whiskey
# Second build
F(A, B) -> D
hash(D): romeo
Imagine we execute C and D and they produce different results! Or, say C is a
debug mode executable and D is one with optimizations, and so one runs faster
than the other. This is incorrect! Our function F
did not produce the same
outputs for the same inputs, which means F
itself has some implicit
configuration or state that is changing behavior. In the context of build
systems, this usually means the build system didn’t treat the command line
and compiler flags used as “inputs”. Bazel goes to great lengths in the build
description to force you to very pedantically describe all these things, so
that it can track all of it as inputs to monitor. A combination of
toolchains, well defined inputs and outputs and sandboxing is used to enforce
this. This usually means we don’t have to deal with such egregious
differences of correctness. That is, it is acceptable to have the same inputs
produce slightly different hashes. This will manifest in two ways. One, Bazel
will use the cache, see the inputs are the same, and instead of producing D
(hash: romeo
), just gets C
(hash: whiskey
) from the cache. This is
usually OK, but we should try to minimize it.
The other way it can go is it decides to build locally, gets D
and now
everything that depends on D
is affected, slowing the local build.
Speed
In terms of build system classification, Bazel has a rebuilder that uses constructive traces to track the build. Roughly this means for a given action, only its immediate inputs affect the caching. So if we had a common 2 step process for a C library:
$ gcc -c foo.o foo.c
$ gcc -o foo foo.o
Which is represented in Bazel roughly as:
cc_binary(
name = "foo",
srcs = ["foo.c"],
)
Bazel will have 2 actions:
compile action:
inputs: foo.c, hash: charlie
outputs: foo.o, hash: delta
link action:
inputs: foo.o, hash: delta
outputs: foo, hash: echo
Say we have a hypothetical compiler that produces a different foo.o
for the
same foo.c
. That is, the hashes don’t match even though they behave
identically.
CI machine:
compile action:
inputs: foo.c, hash: charlie
outputs: foo.o, hash: delta
link action:
inputs: foo.o, hash: delta
outputs: foo, hash: echo
shared cache now has keys: {charlie, delta}
Developer machine:
compile action:
inputs: foo.c, hash: charlie
outputs: foo.o, hash: november (!)
link action:
inputs: foo.o, hash: november (cache miss)
outputs: foo, hash: zulu (local build)
That is a pickle! Say Bazel’s heuristics decide to compile locally instead of
hitting the cache. We end up with foo.o hashing to november
, which is not in
the cache from the CI machine, so Bazel is forced to also run the link step
locally. You can see how this can spread. In large builds, you could easily
have tens of C files per library, some of which are built locally, some pulled
from the cache and every time hash mismatches happen, the build system is
forced to start from scratch down the entire build graph!
Well, that hypothetical compiler exists and it is called the Microsoft Visual Studio C/C++ compiler 2.
What we really want is for every command we run, to produce truly identical outputs so we get maximum cache hits. The fact that identical hashes do also mean truly reproducible builds is a nice side-effect.
Other caching tools like ccache use similar algorithms.
Now, everything I’ve talked about so far is well known in certain circles. Let’s talk about how to actually address this on Windows!
Fix Date and time macros
The C standard provides two macros __DATE__
and __TIME__
that are set to
the time of compile. If a source file actually uses that to set variables or in
strings, that is terrible for our purposes because every compiler invocation
will lead to new values. We are forced to break these. I don’t really know of
any libraries that use this to affect behavior.
For MSVC, we can override the macro definitions:
cl.exe /D__DATE__=CONSTANT /D__TIME__=CONSTANT …
Fix dates and times in Portable Executables
In addition, the PE format
used by Windows for executables and DLLs has file headers that have a
prominent TimeDatestamp
field. This is inserted by the linker. There is an undocumented flag - /Brepro
that causes the linker to put a fixed value in this field.
link.exe /Brepro …
There are a few more places in a PE that have timestamps. This includes the
IMAGE_EXPORT_DIRECTORY
and IMAGE_RESOURCE_DIRECTORY
structures.
Fortunately, there is a nice tool called ducible that can be run on the file after linking to rewrite these bits with constant values.
Integrating ducible
with your build process can be a little involved. The
nicest way I know for Bazel is to tweak the toolchain definition. This has the
nice property of not requiring the rest of your build to know about ducible
.
Even custom rules that leverage cc_common
to create actions will
automatically benefit. If you are using a fixed toolchain configuration
inspired by this Bazel example, you should replace the
linker path with a custom target, instead of a direct path to
link.exe
. This can be a batch script or similar that forwards all the linker
options to link.exe
, then runs ducible
transparently before exiting. This is
not the approach we use internally, since I only thought about it recently.
Instead we have the few places that require this be aware of ducible
and add it
to the Bazel actions or stick it into tool wrappers. That is, our rust compiler
isn’t a direct call to rustc
, but a wrapper script that does a bunch of
things, one of which is running ducible
.
Disable incremental builds
MSVC has an incremental build mode, where the linker adds extra information to files to allow it to build faster on subsequent runs. This causes changes in the hash. Since our build is powered by Bazel, incremental builds also don’t really help us. Disable this.
Deal with the PDBs
Program Database files (PDBs) are the real bane of our quest. We need them to debug software without shipping debug information to users, so we cannot simply disable the production of these files. Everything I’ve discussed above is relatively well known, but this I had to discover for myself.
PDBs have several problems:
- A PDB and a PE are linked so debuggers know exactly which PDB to use from a symbol server. PE file have a debug section that has a timestamp and a signature to identify the PDB. This changes every time. This is fixed by ducible.
- PE files encode the absolute path to the PDB by default.
- PDB files contain absolute paths to all the resources involved, such as object files. This means we need to ensure all paths are the same on every machine.
- Since PEs and PDBs are linked, changes to the PDB always change the PE. This means we need identical PDBs.
- PDB files have a build identifier, so even minor version updates in MSVC lead to hash differences.
Fix the paths
To fix #2, we can pass the /PDBALTPATH:%_PDB%
flag to link.exe
.
This makes the linker encode just the filename instead of an
absolute path.
To fix #3 and #4, we need to control where build trees are located on all
machines. The first problem is that developers will naturally have files in
different locations because the typical storage location is somewhere in
C:\Users\<user name>
and <user name>
is unique. If you are doing in-source
builds, this will be a problem. If you are doing out-of-source builds, this is
easier to fix by having your build system use a well-known location like
C:\build
instead of C:\Users\nikhil\path\to\repo\build
. This way all
absolute paths start as C:\build\...
on any machine, fixing #3 to some
extent.
Fortunately, Bazel always does out of source builds so we can control where to
place the build tree. In addition, external libraries and resources are
also something to take care of. To take care of both of these we need to change
Bazel’s output user root and output base, both of which are usually calculated
based on hashes of things, and stored in the user home directory. The full
details are on their website. Something like
--output_user_root=C:/bazel
and --output_base=C:/bazel/base
is a good
start 3.
The other source of changing paths is temporary files. Depending on your
specific build steps, it is very common to run tools that build something in a
temporary directory. You will need to change this to use a deterministic
location. Using the hash of some set of inputs is a good way to go. We often
use a hash of the repository relative location of files. There are some sources
of temporary files you cannot easily remove, such as this one I
found in rustc
.
With all these changes you should be getting deterministic PEs and PDBs as far as input files go.
Pin the build version
The final bit I noticed was PDBs have a major and minor build number in the DBI stream header. This can change within different versions of the same major toolchain like Visual Studio 2017, so it is important to have CI machines and developer machines on the same update cycle. Some automated deployment stragegy helps here so every developer does not have to remember to manually update their installation.
As far as I’m aware this leads to bit-identical PEs and PDBs in my tests so far!
Useful tools
pdbdump
- Shipped as part ofducible
,pdbdump
is great for seeing the metadata in a pdb. This was what helped me see absolute paths in the PDB.dumpbin
- Part of MSVC.dumpbin /all
run on an EXE or DLL is a good way to see how headers and other metadata differ.- A binary diff tool such as vbindiff. After you’ve exhausted all
the metadata approaches, sometimes you just have to jump into hex and
looking at individual bytes. This is a small and free tool. There is also a
limited, web version of
diffoscope
, but that tool it self does not work on Windows.
Further reading
- An introduction to deterministic builds with C/C++
- Microsoft is planning to introduce a
/d1trimfile
flag in the latest MSVC 2019 to address one area of absolute paths in PDBs.
-
Bazel’s tagline of
{Fast, Correct} - Choose Two
. ↩︎ -
To be clear, I’m not picking on MSVC here. GCC and Clang also come with caveats, but they’ve done a lot more to address these out of the box. ↩︎
-
You cannot put platform specific
startup
options in.bazelrc
yet, so you may need a wrapper script around invokingbazel
to add these flags on Windows. ↩︎