Sampling Profiler Internals: Introduction

Posted on Nov 24, 2018

Sampling profilers are useful tools for performance analysis of programs. I’ve spent a lot of time over the past several months digging into various implementations of sampling profilers and reading a lot about symbolication, threads, stack unwinding and other gory operating system details. This series of posts attempts to summarize my understanding.

Background

High CPU usage is a problem that comes up often in widely used software. A profiler of some sort is indispensible in investigating the cause. Profilers are fairly arcane bits of knowledge in the programming community; they require intensive knowledge of a wide variety of low-level details around operating systems and program execution. You must know exactly what an instruction pointer does, how symbols are stored, how each OS provides thread information and so on. I wasn’t able to find any comprehensive take on all of those concepts. My understanding was gleaned from reading actual profiler source code, and posts covering bits and pieces of the mechanisms. I hope this series fixes that.

Pre-requistes

This series requires basic understanding of:

Program layout - You should understand how a program is laid out in memory, the general format of executables and the concept of shared libraries and symbol tables.
- Linux - A walk through of the ELF format
- Linux - Getting to main()
- MacOS - Let’s Build a Mach-O Executable
- Windows - EXEs and DLLs
Threads - You should know what threads are, how processes have multiple threads and a basic understanding of thread creation and deletion.
Program execution - you should know what the stack is and how function execution proceeds using a combination of the stack and certain CPU registers. In particular, the instruction pointer, frame pointer and stack pointer.
- Journey to the Stack

You don’t need to be a master, so don’t worry about having to read 5 new things before continuing.

We will use simple assembly to illustrate certain points, but it shouldn’t be more than movs and calls.

I will be using Linux for the sample code as that is what my personal machine runs. Information to achieve the same on Windows and macOS will also be present.

Code

vignette is an implementation of everything we will talk about in this series. It is a sampling profiler for native code. It is written in Rust and has been tested on Linux. I’ll add support for other OSes later¹. Patches are certainly welcome!

Full instructions to use it are in the README, and it contains several examples that exercise the parts we are going to look at.

Caveats

A truly production ready profiler has to deal with a lot of things, some of which are very specific to platforms/circumstances. I won’t be covering all the bases here. In addition, such profilers will capture a lot more than just the samples, such as durations, and special markers added by the program. They also certainly try to be very efficient. vignette is currently written to be easy to understand, and to focus on the parts important to this series, so it will not deal with all these edge cases.

What is a sampling profiler?

A sampling profiler runs every N time units and captures the stack running in each thread at that point. The resulting output is usually some combination of the counts and times of the observed stacks. With frequent samples over a long enough time, the output will reflect how the program itself is spending time.

The other kind of profilers are tracing profilers, which use either manual hooks or language facilities to instrument (usually) every function call entry and exit. While they give extremely accurate results due to this granularity, they add significant performance overhead and so may not reflect the true program execution times, or be too heavyweight to run in all scenarios.

The steps to a sampling profiler

At heart, a sampling profiler is simple

samples = SomeDataStructure()

# Usually run for a few seconds.
while running:
    1. Sleep for a few milliseconds
    2. For every thread in the process:
        3. Suspend thread
        4. Capture current thread stack in samples
        5. Resume thread

6. Resolve symbols
7. Write output in a suitable format for consumption

Retrieving the thread stack gives us a list of instruction locations that were being executed by the thread. The Resolve symbols step translates these locations to function names so we can get a human-readable understanding of where the program is spending time. I will often use the term symbolication for this step as it flows better when used as a verb. It is not a real word.

Finally, we will be capturing reams of data during a profiling run. This needs to be translated to nice looking call trees or graphs. Fortunately there are already a lot of tools out there for viewing profiles or creating nice graphs out of them. We will spend some time looking at some of them and getting our profiler to produce such output.

In-process and out-of-process profilers

Out-of-process (or remote) profilers use OS facilities to attach to the program to be profiled, similar to debuggers. They have to remotely interact with the process and read the profiled process’s memory. These are usually what developers will use locally as they are sold independently by vendors or come with the operating system. All 3 common consumer OSes have high quality sampling profilers built-in.

Windows has the Event Tracing for Windows framework.
macOS has the Instruments GUI app and the command line program sample which can attach remotely.
Linux has a whole set of tools over the years. Common ones include callgrind, gprof and perf.

In addition, specific languages often have sampling profilers. There are JavaScript profilers in all browsers, Python has vmprof, pyflame and py-spy.

In-process (or local) profilers are libraries that are statically or dynamically linked to the program that will be profiled, and usually controlled through some API. They have direct access to full process address space and are generally easier to implement. Examples of these are the profilers built into Chromium and Firefox (Gecko).

These profilers are really useful for widely used desktop software as asking users to attach external profilers is unreasonable. In addition, built-in profilers allow measuring granular actions. You may only want to profile startup, or only profile a fibonacci calculation done very rarely.

Desktop software often also ships in a release configuration, which does not have debugging symbols in the executable. This saves space and download costs. We will go deep into symbolication in a later post, but this means that such profilers need some way to access those symbols externally.

vignette falls into the latter camp. It does not perform symbolication at profile gathering time, and it does not require a binary to have debug symbols. The initial profile output from vignette is useless for a human.

There are other advanced tools that use other mechanisms than sampling, such as CPU performance counters, dynamic instrumentation and so on that I will not cover.

Next steps

The first task a profiler needs to do is get a list of threads and suspend and resume them.