Sampling Profiler Internals: Suspending Threads

This is part 2 of the Sampling Profilers Internals series. Introduction Suspending threads Stack unwinding Symbolication Presenting profile output Extending the profiler to managed languages As described in the introduction, a sampling profiler captures the stack of each thread every few milliseconds. To do this, it is preferable to suspend the thread 1. We don’t want the stack changing under us as we are profiling.

Sampling Profiler Internals: Introduction

Sampling profilers are useful tools for performance analysis of programs. I’ve spent a lot of time over the past several months digging into various implementations of sampling profilers and reading a lot about symbolication, threads, stack unwinding and other gory operating system details. This series of posts attempts to summarize my understanding. Introduction Suspending threads Stack unwinding Symbolication Presenting profile output Extending the profiler to managed languages Background High CPU usage is a problem that comes up often in widely used software.

Diving into the Python call stack (PyGotham 2018)

I gave a talk at PyGotham 2018 about how Python implements stack frames and how Dropbox leveraged that to improve crash reporting using Crashpad. I also contributed to the Dropbox Tech Blog post that goes into great detail on the crash reporting pipeline. The talk was less about Crashpad and more about Python internals. There is a video and slides. As part of preparing for the talk, I wrote the following post.

The main thread name and process name are the same thing on Linux!

On typical days, we software engineers are usually stuck due to head scratching bugs, instead of actually writing interesting software. I had made some small changes and suddenly our end-to-end tests were timing out, with no useful clues. The harness emits a crazy number of log lines, so digging through them was difficult. Fortunately, the problem was easily reproduceable on a local VM and with some piecemeal commenting I was able to isolate it to changing the thread name of the main thread, that I had done for some debugging information.

Why does my stack have an extra 4 bytes? Digging into Clang's return value implementation

I was futzing around with some C code a few days ago and noticed that executables generated by Clang would sometimes have an extra 4 bytes on the stack. This was just for the main function. We can verify this is Compiler Explorer. Try switching to GCC and this doesn’t happen. This was interesting, so I spent a few hours over the holiday digging into why and how this happens.

CppCon 2017 Talks I enjoyed

I spent a recent holiday listening to several CppCon talks. I’m hooked! I was impressed by the generally high quality of the talks. The lack of “use my/my company’s framework which is so awesome!” talks was refreshing. In addition, the use cases where C++ trumps most competition are often either performance sensitive, or correctness sensitive under strong constraints. Of course, this means contending with the syntactic and semantic complexities C++ throws at you.

Using Windows Job Objects for Process Tree Management

Using child processes to perform various tasks is a standard construct in larger programs. The simplest reason is this gets you memory isolation and resource management for free, with the OS managing scheduling and file descriptors and other resources. A common requirement when using multiple processes is the ability to wait on or kill one or more of these children. It is not always possible to record process IDs at fork(), since the fork may happen in a library that does not give you such access.

Debugging MacOS file locking with DTrace

A few days ago, I was stymied at work by a set of tests that had intermittent failures on OSX but not Windows. There was a process which would try to obtain an exclusive lock on a file, using the lock-on-open provided by the BSD/MacOS O_EXLOCK flag to open(2). It also used O_NONBLOCK; if the file was locked by another process, it could be skipped. The process would hold the lock and remove (unlink(2)) the file, before close(2)-ing the descriptor.

Canceling socket operations using I/O multiplexing.

I spent the last couple of months diving into networking code at work. This led to some interesting discoveries, including how to allow an in-progress network operation to be canceled on demand. This can be an attempt to establish a connection, or a socket read or write. The standard use case is to allow the user to cancel an operation or to allow a clean shutdown when multiple threads are operating on sockets.

Systems We ❤

I had a really great time at Systems We Love 2 weeks ago. It was refreshing to attend a conference that had genuine talks about complicated systems that keep or kept the “world” running. No sponsor-driven drivel and no “here is how to do X” talks that could’ve been summarized by documentation. Ozan Onay has already summarized all the talks, including links to videos, so I don’t have to. A few talks stood out in particular.