Mystery Knowledge and Useful Tools

Posted on Oct 5, 2020

Hillel Wayne has a great newsletter, and one recent post had this observation:

The abstract concept here is knowledge or skills that

  1. You are unlikely to discover on your own, neither through practice and reflection nor by observing others apply it.
  2. Once somebody tells you about it, you can easily learn and apply it.
  3. Once you can use it, it immediately gives you significant benefits, possibly to the point of raising your expertise level.

This might be a studied topic, but if it is I don’t know even what field of knowledge it belongs to, much less what it’s called. In the meantime I call it mystery knowledge.

I see this a lot at my job, where less experienced engineers will struggle not because they lack fundamental knowledge or are “dumb”, but because they are just not aware of the tools out there. To clarify, I am also often in the same boat, like when someone pulls out Hopper, or does magic in Hive, so I’m not saying I’m a paragon of tool enlightenment. Debugging is all about eliminating hypotheses and reducing the problem space by obtaining useful information quickly. Given the complexity of our systems today, obtaining information can get really difficult. Knowing which tools to reach for helps. Most of them are really easy to use, as long as you know they exist and have an idea of what they do. Consider this a living document of tooling “mystery knowledge” that I’ve built up over the years. Where possible, I’ve included examples of how I use it, to give you some sense of what to do with it. It is calibrated towards desktop applications as that is what I’ve done for most of my career. I think this is much more valuable today when a majority of developers come from a web/backend development background, where they are primarily concerned with web browsers or Linux and where desktop development is less common.

Tools whose names are in code font are command-line programs already installed on the operating system.

Text processing and code reading

  1. Your internal code search and IDE/editor search - Reading code will almost always prove useful. Reading not just your code, but code of third party programs you use when available. It can help explain unintuitive behavior, and sometimes you might even find a bug where you don’t expect it. Semantic code search tools make it much easier to understand code.

  2. grep/ripgrep - A side effect of operating on text source files is you often want to quickly find terms in those source files. Both of these tools print lines matching a regular expression in the set of files passed to them. ripgrep is almost always better because it is recursive by default, can respect VCS (Version Control System) ignored files and color the output.

  3. tail -f - Another logs, logs, logs tool. Use it to tail a file live. Most often used to observe a server log while you make requests to it.

  4. awk - A full programming language for text processing. I mainly use it as a glorified cut. Most common use for me is to use git status to get a list of changed files, filter by M or D and then use awk to extract only the file name.

  5. pbcopy (macOS) / clip (Windows) - Pipe text into these commands to copy it to the system clipboard. Really useful to bridge the command-line <-> GUI gap when you want to send command output to someone via Slack or add it to a doc.

Files and file formats

One of the fun parts of being a desktop developer is routinely working with 3 different operating systems and their attendant idiosyncracies. File formats are a big part of that as each has its own executable and library formats, debug data formats and common compression formats. Having knowledge of the existence of these formats and tools to inspect these formats can go a long way towards tracking down things like linker and compiler misconfigurations.

  1. file (POSIX) - Absolutely the first program you should reach for when you don’t know what kind of file something is. It will try to match bits of the file against all sorts of patterns and try to tell you what it is. For example, most “custom” files are just a zip file in disguise - Browser extensions, certain document formats, Windows appx’s and so on.
  2. strings (POSIX) - A lot of “binary” files (like executables) still contain human-readable strings. Often you only really care about those human readable strings and don’t care what kind of file it is. Trying to find a custom parser for the file is overkill. strings will print out all human-readable strings longer than a certain length that it finds. My most common use is to get certain information from crash dumps or to get file paths from DLLs, or to confirm that a C string literal correctly made it into the final executable (particularly when it came from a macro). Note that most text editors can also show you the file and let you search through it. You should reach for that first. strings is better for pipelines or batch processing.
  3. ar (POSIX) - Browse the object files inside a static library. Useful for tracking down gnarly compiler bugs.
  4. nm (POSIX) - Allows you to inspect static and dynamic libraries and see what symbols they define, which ones they export and which ones they rely on other libraries for. Useful for all those Undefined symbol errors or verifying that your linker visibility flags are working.
  5. ldd (Linux) / otool -L (macOS) - Find out which dynamic libraries or frameworks your library or executable depends on. They can help investigate issues like rpath failures. We commonly use them at Dropbox to verify that our final builds are correctly linked against the libraries we want.
  6. dwarfdump (POSIX) - Show debug information in libraries and executables, as well as dSYM bundles. Also useful for matching executables against their dSYMs by extracing LC_UUIDs.
  7. readelf (Linux) / otool -l (macOS) / dumpbin (Windows) - Tells you all sorts of information about the executable formats (ELF, Mach-O and PE respectively) on each of these operating systems. This includes various headers that tell the OS how to load the program and which libraries it depends on. This can be used to verify that a build had ASLR (Address Space Layout Randomization) enabled for example, or to determine debug symbol UUIDs on Windows (similar to dwarfdump).
  8. diffoscope (cross platform) - Like diff, but for all kinds of file comparison, not just text. It can tell you which object file in a static archive is different for example. I’ve used it extensively to track down sources of non-reproducibility. There is a web version for small files, or you can run it locally on Linux.
  9. HexFiend (macOS) - A fast and lightweight hex editor. A hex editor is a program that can display any binary file as a sequence of bytes. It has a useful compare mode for viewing diffs of binary files. I’ve mostly used it to interactively explore crash dumps when writing tools to process them. It has other nifty features like allowing you to select ranges of bytes and then interpret them as numbers. It also reminds me a lot of the old times when desktop apps were fast and tiny and a joy to use.
  10. pdbdump (Windows) - A tool that ships with ducible. It can tell you a bunch of information about PDB files, including which absolute paths they refer to. Also used for tracking down reproducibility issues.
  11. Minidump Explorer (Windows) - Explore Windows/Crashpad crash dumps (minidump files) which are basically a set of nested structs in serialized form.

Observing processes

  1. Process Explorer (Windows) - A better Task Manager. For me, the most powerful ability is for it to show the process' open files and handles. I used this to track down a very odd issue where we were seeing Bazel failing to delete certain files. I first used Process Monitor to determine which directory Bazel was trying to delete. I then used Process Explorer to determine which process was holding the file open. This turned out to be Python holding the win32api DLL open. This made no sense because the code was not directly importing any win32 libraries. I inserted pdb halts at various import boundaries, using Process Explorer to track which DLLs were loaded after each import. Using some semblance of binary search, I was able to track down that pkg_resources will import win32api to get some information if it is available in the path! A similar non-GUI tool on POSIX is lsof.

  2. Process Monitor (Windows) - Really, one should just become familiar with all the SysInternals tools. This tool tracks all file and registry operations. Where Process Explorer shows a live view, Process Monitor collects a log, and allows you to filter the output by all sorts of selectors. It would have been extremely difficult to diagnose and reproduce Bazel bug #12033 without it. I’ve written up how to observe the bug, including screenshots of using Process Monitor.

  3. pmap (Linux), vmmap (macOS), VMMap (Windows) - Get a bird’s eye view of a process' memory usage. Unlike tools like valgrind, which will only instrument heap allocated memory, this can give you the entire virtual memory usage, split by heap, stack, memory mapped files, libraries and so on. Of course, it only shows the current snapshot and won’t let you track the cause of leaks, but it can help identify their presence in a lightweight way, with no custom instrumentation or code changes. I’ve used this to understand application memory usage and identify the biggest areas for improvement. They can also help with things like pymalloc which explicitly uses virtual memory directly for small object allocations and is thus invisible to heap allocation trackers.

  4. sample (macOS) - It can be run via Activity Manager or via the command line. It will observe a process for a few seconds and collect stack traces of every thread categorized by counts. This gives you a general sense of where the threads are spending time in that time period. Unfortunately its output is not very easy to read, but the flamegraph tools can render an image.

  5. valgrind + kcachegrind (Linux) - Valgrind is a suite of tools and is possibly the most sophisticated open-source instrumentation framework for software developers (not necessarily for security professionals). It essentially runs programs while sitting between them and the CPU and the OS and instruments all sorts of things, from CPU usage to memory allocations. I’ve never had to use this beyond the very basics, so I’m not qualified to talk about it, but you can hear straight from one of the horses mouths. I’ve only used its call profiler (and visualized the output with kcachegrind) to understand the expensive parts of ninjars.

  6. memory allocator tooling - Every platform’s memory allocation mechanisms (which implement malloc(3) and free(3)) provide useful environment variables and tools to understand how your program is using memory. These lie somewhere between vmmap and valgrind in that they provide more granular info without affecting performance as badly. Read the documentation.

  7. dtrace (macOS) - DTrace is another one of those ridiculously powerful ideas that arose in Solaris and was kinda ignored by everyone else for a long time. Event Tracing for Windows is probably the earliest that a widely used OS had something similar. macOS adopted DTrace in Leopard (10.5). Linux is only now getting eBPF to allow similar things (although strace() has existed forever, it is a shadow of what DTrace can do). DTrace is an operating system level instrumentation facility that has very low overhead while allowing very powerful introspection capabilities into the OS and processes. There are thousands of available probes (things you can instrument) on macOS. Python also provides a probe since 3.6. I’ve used dtrace to track down a syscall misunderstanding bug. One of the authors is a prolific source of knowledge. Note that using dtrace on macOS requires disabling System Integrity Protection.

  8. UIforETW and Windows Performance Analyzer (Windows) - Windows has a system instrumentation framework similar to dtrace called Event Tracing for Windows (ETW). While powerful and with good visualization tooling, it is difficult to use. Fortunately Chrome and Windows guru Bruce Lawson has written a nice tool - UIforETW - that makes it significantly easier. Press a few buttons, run your problem program, then inspect the trace at your leisure. He has amusing posts about using it to find all sorts of bugs. I’ve mainly used it to track heap allocations. Start here.

  9. Instruments.app (macOS) - A front-end to a bunch of macOS performance tooling, one will most commonly use it for CPU and memory profiling. While Instruments can be janky sometimes, it is powerful and I’ve used it to track memory allocation on more than one occasion. It also provides a vmmap snapshot viewer.

  10. Chrome Trace Viewer (Chrome/Chromium) - A part of Chrome’s developer tools, Chrome Trace Viewer’s original purpose is to observe timing traces from the browser itself. This is used to track down slow parts of pages and scripts. Due to a simple json format, it has become a de-facto viewer for a lot of other tools. Everything from Linux' pprof to Bazel’s internal profiler can output execution traces to a format that the Trace Viewer can load. Go to chrome://tracing and hit Load to load a profile. It is useful to know this exists and play around with it a little.

  11. DebugView (Windows) - A useful application to see certain debug messages from certain GUI apps that don’t write to stdout/stderr.

Debuggers

It is really important to have at least a basic understanding of platform debuggers.

  1. lldb (cross platform, installed on macOS with Xcode) - I’m not a huge fan of command line debuggers due to non-discoverability and loss of visualization. That said, lldb (and gdb) has some redeeming points like Python scriptability (although the API documentation could use some love and guides). The proof-of-concept to show that extracting Python stack frames from native crashes was possible was an lldb python script that I would launch after putting a breakpoint at PyEval_EvalFrameEx.

  2. Visual Studio (Windows) - A much better debugger. It can show you addresses, local variables and data structures in reasonably intuitive ways. In addition, with symbol server integration (which you absolutely should be using for any of your code), you can debug any build of your program without needing to build the symbols for it locally.

  3. Browser devtools - These let you do a bunch of powerful things. I don’t directly use them for development, but I use them all the time for figuring out why a website won’t load.

Observing the environment

  1. ls and its variants - What? ls? Isn’t that the simplest thing around? Why mention it? Because roughly 10% of problems can be identified by simply running ls -l before and after a command, or during a command, or within a program (i.e. os.listdir() in strategic places). Seriously, aren’t half of day to day annoyances files not being found? Put a ls and see whether your assumptions are correct.

  2. DaisyDisk (macOS), WinDirStat (Windows) - Tools to visualize how much space each directory in a directory tree occupies. WinDirStat in particular has a really cool TreeMap visualization that goes all the way down to the leaf directories and visualizes them. Very useful for deciding where to get the biggest savings in disk space.

Observing the network

  1. Wireshark - A packet sniffer and swiss army knife of network protocols, Wireshark can help you understand all sorts of traffic going over the network. Even in this day of HTTPS everywhere, it can still be useful to determine if a program is even making a connection.

  2. Network Link Conditioner (macOS) - A macOS preference panel that can be installed from Xcode, it allows you to simulate slow DNS, slow networks, packet drops and other interesting misbehaviors. Useful for testing how your jazzy webview wrapping “desktop app” is going to function on an unreliable network. I used it extensively to test various code paths in the Dropbox desktop client Happy Eyeballs implementation. I’m sure similar tools exist for other platforms.

Tools to write tools

  1. Python’s struct module - Part of the standard library, this is extremely useful for serializing and deserializing simple binary data. I will often use this for ad-hoc analysis of binary files or network bytes. For example I’ve written some code to extract very specific bits of minidump files.

  2. Your language’s I/O and environment libraries - Do you know how many problems can be figured out by dumping all the environment variables that are set at a specific point in your program? A lot. Are you in a tricky situation where you can’t write to stdout or stderr because it is too noisy? Change your relevant code to simply print things out to a particular file. In Python this is super easy. You just add file=<file object> as a keyword argument to any print()s.

Sleep

I don’t mean sleeping on it (though that helps), but rather putting the thread or process to sleep (std::thread::sleep() in Rust, time.sleep() in Python and so on). Half of all debugging challenges are because computers are too fast or they are not deterministic. If you ever suspect a race condition, and have a hypothesis that it happens, you don’t need to do a million runs. Insert some strategic sleeps in threads and processes to get them to sequence exactly how you want. It can also be used in situations where you are not sure exactly when you want to attach a debugger, or the condition is too hard to express conditionally. Another time I use them is when I need to run some monitoring tool, but if I start the tool from the beginning of the program run, I’ll be forced to do a lot of filtering and processing to get down to the useful bits. Instead:

  1. Make both processes/threads sleep for several minutes right before the relevant code begins. Or even better, have then block on stdin, and resume when you press enter.
  2. Run them and let them hit this block.
  3. Now you have time to start your debugger or other tools and begin observing things.
  4. Processes continue execution.

Specific tools

This is knowledge relevant to very specific tools used in my jobs or domains that won’t apply outside it.

Bazel

  1. Execution logs - Bazel can write detailed logs about which actions were executed and which ones were retrieved from cache. This can be very useful when investigating reproducibility and figuring out why caches aren’t being hit.

  2. Profiles - Bazel is one of the tools that can generate Chrome Trace Viewer compatible timing profiles of both its own internals and of command execution time. This is very useful to track down the slowest parts of the build.

Rust

rustc has a very good self profiling system that can be used to track down what parts of your build are slow and why. We were able to track down that the Windows linker (link.exe) is excruciatingly slow compared to LLVM’s lld and got a great speed bump.

Python

  1. sys.path - Half of Python import failures can be solved by simply printing sys.path at appropriate places and observing where Python is trying to load from. Do you know that running Python with -vv will cause it to print out the paths that were searched to look for a module and where it was found?
  2. pdb - The Python debugger - not to be confused with Windows PDB symbol files. You can insert pdb.set_trace() or breakpoint() (since Pyton 3.7) to get dropped into an interactive debugger.

Bisection

When all else fails, turn to bisect in your VCS. Sometimes, depending on the nature of the bug, you may want to use this as your very first defense.

Bisect is immensely useful in action-at-a-distance bugs where you can’t just read code or commit messages and know exactly where the problem lies. As an example, we had Crashpad being unable to obtain certain information from crash reports, but with no changes to any of our crash reporting code. It would only manifest in release builds, not in development builds. Now, the canonical bisect workflow will tell you to run the test at each step and you may be inclined to start the bisection and create a build, then test and repeat. Except, creating full release builds can take a while. Nothing says you have to run the actual test during the bisect. The bisect only exists for you to tell the VCS if this was a good or bad commit. Instead I narrowed down the problem to a span of releases (much faster since we knew which release changed from our crash reporting service). Then I just queued up a release build job for all commits in the range on our CI system. That sounds inefficient! It is, but computers are good at this stuff, and they are fast and they are cheap! You don’t want to be kicking off a 100 builds every day, but you can kick them off every once in a while and nobody will notice.

Once all of them were ready, I started the actual bisect, where I could do the relatively manual job of installing that version, making it crash and checking if the bug happened. Then I’d tell git if it happened or not, and it would tell me which of my builds I should go test next. The problem turned out to be a linker flag change.