Sampling Profiler Internals: Suspending Threads
This is part 2 of the Sampling Profilers Internals series.
As described in the introduction, a sampling profiler captures the stack of each thread every few milliseconds. To do this, it is preferable to suspend the thread 1. We don’t want the stack changing under us as we are profiling. Even worse, a thread could just terminate while we were attempting to profile it and invalidate all the memory we want to read. It is safer to just suspend it and resume once the capture is done.
Pitfalls of suspending threads
This is probably the most important lesson of building an in-process profiler. You can do most other things wrong and at least get some output, but if you ignore the advice here, you may not get output at all and will likely make the program misbehave. There are three things your profiler CANNOT do while it has a thread suspended:
- You MUST NOT forget to resume the suspended thread - failing to do this would prevent programs from making progress.
- You MUST NOT suspend the sampling thread itself - doing this will prevent the profiler from making progress.
- You MUST NOT allocate memory or acquire locks on the sampler thread that
other threads have access to. When a thread is suspended, the thread may be
holding locks, making assumptions about certain memory locations and so on.
There are a lot of locks that are created by various platform APIs
per-process and operated on behind the scenes, including things like
printf
andmalloc
! Let’s say the thread acquired the allocator lock and then was suspended. If your sampling thread attempts to allocate memory, it will block on the allocator lock and your program will deadlock! This means any memory we need to storing data about the thread stack (which will be covered in Part 3) must be pre-allocated before we suspend the thread. You cannot dynamically resize this while any thread is suspended.
Finally, suspending a thread does have a performance penalty. First, it slows down the program. We try to minimize this by keeping our stack collection as fast as possible. Second, pausing the thread may force the OS to context switch to get another thread going. There isn’t much we can do about this.
Suspending and resuming threads is one of the easier parts of profiling. All 3 OSes allow easily suspending threads of the current process
Sampling all threads or just registered threads.
A sampling profiler can choose to sample every thread in the application, or only certain threads. The latter solution is good if you want to offer selective profiling, like browsers often do for web pages. It can also improve the performance of the profiler.
Selective profiling is achieved by having interested threads call some function to register themselves with the profiler.
Here we will stick to sampling every thread to keep the profiler logic simpler.
Windows
Windows is probably the simplest to suspend and resume, but also the most annoying to iterate over, because there isn’t an API to only iterate the threads of a given process. This means you end up iterating over every thread in the system and discard the ones you don’t care about, which is inefficient. A profiler that only cares about registered threads and stores their HANDLEs in a list will do better.
First, one uses the CreateToolhelp32Snapshot
function to obtain a snapshot of
all running threads. Then, the Thread32First
and Thread32Next
functions can
iterate over this snapshot and obtain thread information. MSDN has a code
sample
about using the thread iteration APIs, so it should be clear. We can compare the thread’s th32OwnerProcessID
with GetCurrentProcessId()
to restrict to threads from our process.
Once we have a thread ID, we obtain a handle to it using the OpenThread()
function. Then use the SuspendThread()
function, walk the stack and
ResumeThread()
.
HANDLE thread_handle = OpenThread(THREAD_SUSPEND_RESUME | THREAD_GET_CONTEXT, False, te32.th32ThreadID);
if (SuspendThread(thread_handle) == -1) {
// handle error
return;
}
// walk the stack
if (ResumeThread(thread_handle) == -1) {
// at this point we probably want to crash the program as this is a bad state to be in!
abort();
}
macOS
(function names link to documentation)
On Mac, thread suspension requires using the Mach subsystem. Mach calls within the same process are generally unrestricted, so no security measures need to be disabled.
Start by getting a handle to the process itself using mach_self()
and then we can obtain a list of threads using the
task_threads
function.
The Mach structures and APIs are not always well documented, and often fiddly, so it is best to see working code from other projects. psutil has an example.
task_threads
will return a list of thread_port_act_t
structures, which can be passed to the suspend and resume functions directly.
The
thread_suspend()
and
thread_resume()
functions do what they say.
Gecko
and Chromium both uses these functions. Incidentally, Mach also offers a
thread_sample()
function which will sample and write out PC values to a port (a queue). This is
pretty cool, but I’ve not seen it used in practice.
Both Windows and macOS have reference counted suspension counts. It is important to call resume as many times as suspend is called!
Linux
To obtain a list of threads for a process, it is easiest to use the proc
filesystem. The /proc/self/task
directory has subdirectories for every thread, identified by the kernel task ID. We can just iterate over these.
Suspending and resuming threads is really involved on Linux. One has to use a complicated set of synchronization primitives combined with signals. I’ve linked to the vignette implementation throughout. I’ll use the term “sampler thread” for the thread the profiler is running on, and “sampled thread” for the thread we are interested in profiling.
- Set up a process-wide signal handler for the
SIGPROF
signal. We have to pass theSA_SIGINFO
flag to use the 3-argument handler. This let’s us access theucontext_t
param required for unwinding later. - Set up a series of semaphores for synchronization with the sampled thread.
- Send a SIGPROF to the sampled thread. Unlike, the
kill(2)
function which sends a signal to an arbitrary thread, we usetgkill(2)
. - Great! When the sampled thread re-enters userspace, it will receive the signal, and the signal handler will be invoked. The sampler thread uses the first semaphore (
msg2
in vignette) to block until the signal handler acknowledges it.
When the signal handler runs in the context of the sampled thread, the original operation of the thread is now suspended. When the signal handler exits, it will be resumed. We use a combination of semaphores to communicate with the sampler and only resume after we have the information we need.
- Within the handler, we first copy the context that we will need later. We use
msg2
to notify the sampling thread that we have a context. The sampled thread waits onmsg3
and is effectively suspended. - On the sampler thread, we use the context to walk the stack. Then we notify
msg3
so the sampled thread can resume itself. We wait onmsg4
to be absolutely sure the sampled thread is resumed. This is required because we have shared state and shared semaphores, so we cannot move on to the next thread until both ends have finished! If we were to currently send a signal to another thread, that could run in its entirety and all our semaphores are now in states we cannot predict. - The sampled thread simply notifies
msg4
since it doesn’t have anything to do.
As a diagram:
Sampler thread Sampled thread
-------------- --------------
send SIGPROF -------->
wait on msg2
SIGPROF received
<-------- notify msg2
wait on msg3
walk the stack
notify msg3 -------->
wait on msg4 <-------- notify msg4
great! sampled thread successfully resumed
Out-of-process profilers
The Windows and Mac suspension mechanisms remain similar. You do need permission to perform those actions on another process.
On a Mac, you’d use the Mach APIs to retrieve a task port from a BSD process PID.
On Linux, since the app being profiled is not aware of the profiler, setting a signal handler is difficult. The correct way is to use ptrace(2), which operates on a per-thread (task) level. After attaching to a thread, the registers can be read for unwinding. Clearly this seems fraught with several edge cases.