Sampling Profiler Internals: Suspending Threads

This is part 2 of the Sampling Profilers Internals series.

  1. Introduction
  2. Suspending threads
  3. Stack unwinding
  4. Symbolication
  5. Presenting profile output
  6. Extending the profiler to managed languages

As described in the introduction, a sampling profiler captures the stack of each thread every few milliseconds. To do this, it is preferable to suspend the thread 1. We don’t want the stack changing under us as we are profiling. Even worse, a thread could just terminate while we were attempting to profile it and invalidate all the memory we want to read. It is safer to just suspend it and resume once the capture is done.

Pitfalls of suspending threads

This is probably the most important lesson of building an in-process profiler. You can do most other things wrong and at least get some output, but if you ignore the advice here, you may not get output at all and will likely make the program misbehave. There are three things your profiler CANNOT do while it has a thread suspended:

  1. You MUST NOT forget to resume the suspended thread - failing to do this would prevent programs from making progress.
  2. You MUST NOT suspend the sampling thread itself - doing this will prevent the profiler from making progress.
  3. You MUST NOT allocate memory or acquire locks on the sampler thread that other threads have access to. When a thread is suspended, the thread may be holding locks, making assumptions about certain memory locations and so on. There are a lot of locks that are created by various platform APIs per-process and operated on behind the scenes, including things like printf and malloc! Let’s say the thread acquired the allocator lock and then was suspended. If your sampling thread attempts to allocate memory, it will block on the allocator lock and your program will deadlock! This means any memory we need to storing data about the thread stack (which will be covered in Part 3) must be pre-allocated before we suspend the thread. You cannot dynamically resize this while any thread is suspended.

Finally, suspending a thread does have a performance penalty. First, it slows down the program. We try to minimize this by keeping our stack collection as fast as possible. Second, pausing the thread may force the OS to context switch to get another thread going. There isn’t much we can do about this.

Suspending and resuming threads is one of the easier parts of profiling. All 3 OSes allow easily suspending threads of the current process

Sampling all threads or just registered threads.

A sampling profiler can choose to sample every thread in the application, or only certain threads. The latter solution is good if you want to offer selective profiling, like browsers often do for web pages. It can also improve the performance of the profiler.

Selective profiling is achieved by having interested threads call some function to register themselves with the profiler.

Here we will stick to sampling every thread to keep the profiler logic simpler.

Windows

Windows is probably the simplest to suspend and resume, but also the most annoying to iterate over, because there isn’t an API to only iterate the threads of a given process. This means you end up iterating over every thread in the system and discard the ones you don’t care about, which is inefficient. A profiler that only cares about registered threads and stores their HANDLEs in a list will do better.

First, one uses the CreateToolhelp32Snapshot function to obtain a snapshot of all running threads. Then, the Thread32First and Thread32Next functions can iterate over this snapshot and obtain thread information. MSDN has a code sample about using the thread iteration APIs, so it should be clear. We can compare the thread’s th32OwnerProcessID with GetCurrentProcessId() to restrict to threads from our process.

Once we have a thread ID, we obtain a handle to it using the OpenThread() function. Then use the SuspendThread() function, walk the stack and ResumeThread().

HANDLE thread_handle = OpenThread(THREAD_SUSPEND_RESUME | THREAD_GET_CONTEXT, False, te32.th32ThreadID);
if (SuspendThread(thread_handle) == -1) {
  // handle error
  return;
}

// walk the stack

if (ResumeThread(thread_handle) == -1) {
  // at this point we probably want to crash the program as this is a bad state to be in!
  abort();
}

macOS

(function names link to documentation)

On Mac, thread suspension requires using the Mach subsystem. Mach calls within the same process are generally unrestricted, so no security measures need to be disabled.

Start by getting a handle to the process itself using mach_self() and then we can obtain a list of threads using the task_threads function.

The Mach structures and APIs are not always well documented, and often fiddly, so it is best to see working code from other projects. psutil has an example.

task_threads will return a list of thread_port_act_t structures, which can be passed to the suspend and resume functions directly.

The thread_suspend() and thread_resume() functions do what they say. Gecko and Chromium both uses these functions. Incidentally, Mach also offers a thread_sample() function which will sample and write out PC values to a port (a queue). This is pretty cool, but I’ve not seen it used in practice.

Both Windows and macOS have reference counted suspension counts. It is important to call resume as many times as suspend is called!

Linux

To obtain a list of threads for a process, it is easiest to use the proc filesystem. The /proc/self/task directory has subdirectories for every thread, identified by the kernel task ID. We can just iterate over these.

Suspending and resuming threads is really involved on Linux. One has to use a complicated set of synchronization primitives combined with signals. I’ve linked to the vignette implementation throughout. I’ll use the term “sampler thread” for the thread the profiler is running on, and “sampled thread” for the thread we are interested in profiling.

  1. Set up a process-wide signal handler for the SIGPROF signal. We have to pass the SA_SIGINFO flag to use the 3-argument handler. This let’s us access the ucontext_t param required for unwinding later.
  2. Set up a series of semaphores for synchronization with the sampled thread.
  3. Send a SIGPROF to the sampled thread. Unlike, the kill(2) function which sends a signal to an arbitrary thread, we use tgkill(2).
  4. Great! When the sampled thread re-enters userspace, it will receive the signal, and the signal handler will be invoked. The sampler thread uses the first semaphore (msg2 in vignette) to block until the signal handler acknowledges it.

When the signal handler runs in the context of the sampled thread, the original operation of the thread is now suspended. When the signal handler exits, it will be resumed. We use a combination of semaphores to communicate with the sampler and only resume after we have the information we need.

  1. Within the handler, we first copy the context that we will need later. We use msg2 to notify the sampling thread that we have a context. The sampled thread waits on msg3 and is effectively suspended.
  2. On the sampler thread, we use the context to walk the stack. Then we notify msg3 so the sampled thread can resume itself. We wait on msg4 to be absolutely sure the sampled thread is resumed. This is required because we have shared state and shared semaphores, so we cannot move on to the next thread until both ends have finished! If we were to currently send a signal to another thread, that could run in its entirety and all our semaphores are now in states we cannot predict.
  3. The sampled thread simply notifies msg4 since it doesn’t have anything to do.

As a diagram:

  Sampler thread              Sampled thread
  --------------              --------------
  send SIGPROF    -------->
  wait on msg2
                              SIGPROF received
                  <--------   notify msg2
                              wait on msg3
  walk the stack
  notify msg3     -------->
  wait on msg4    <--------   notify msg4
  great! sampled thread successfully resumed

Out-of-process profilers

The Windows and Mac suspension mechanisms remain similar. You do need permission to perform those actions on another process.

On a Mac, you’d use the Mach APIs to retrieve a task port from a BSD process PID.

On Linux, since the app being profiled is not aware of the profiler, setting a signal handler is difficult. The correct way is to use ptrace(2), which operates on a per-thread (task) level. After attaching to a thread, the registers can be read for unwinding. Clearly this seems fraught with several edge cases.