Diving into the Python call stack (PyGotham 2018)
I gave a talk at PyGotham 2018 about how Python implements stack frames and how Dropbox leveraged that to improve crash reporting using Crashpad. I also contributed to the Dropbox Tech Blog post that goes into great detail on the crash reporting pipeline.
The talk was less about Crashpad and more about Python internals. There is a video and slides. As part of preparing for the talk, I wrote the following post. It started off to get my thoughts in order and not be distracted by making slides, so the tone is more suitable to speaking than writing. The finishing touches were made in speaker notes and migrated back to here.
Disclaimer: Words are my own and do not represent the views of my employer.
While I have used Python for small scripts for over a decade, it was only since starting at Dropbox in 2017 that it has been my primary language. At Dropbox I work on the desktop client, a majority of which is written in Python. Bugs are inevitable, even in a language as nice as Python. Since we interoperate with the operating system so much, a lot of our Python code calls into native code. This code can dereference null pointers, libraries can mis-use types or there may be weird thread interactions, all of which can lead to crashes.
Our crash reporting solution for several years was simply a signal handler for things like segmentation faults. This python signal handler would attempt to extract python stacks for all threads and upload them. This doesn’t always work:
- If our application failed before the signal handler was installed, we didn’t get reports.
- In-process handlers are restricted due to potentially corrupted state of the program and limited resources (signal safety). In addition it is really easy to end up in deadlock or violate other invariants if one isn’t careful since signals indiscriminately interrupt normal execution.
In late 2016 we started to migrate to Crashpad, an solution developed and used by the Google Chrome team. Crashpad is a crash reporting solution that runs as a separate process. We can start it really really early in program execution, before Python has even been initialized. Once started, it just sits in the background, waiting for Dropbox to crash. Since it runs out of process it can reliably observe all crashes. When it detects a crash, it captures a lot of useful information like machine details, various parts of the program state and some custom attributes we add indicating the Dropbox version and so on. This is uploaded to our servers.
The server side tooling for processing its crashes is also mature. This would significantly improved our crash reporting rate and quality.
Crashpad is designed for native applications, so it only captures the C stack. Suddenly, while engineers had more reliable crash reporting, they also had pretty useless reporting.
What I mean is, in a language like Python, when an exception is thrown, you get a nice traceback pointing exactly where things went wrong.
Traceback (most recent call last):
File "run.py", line 16, in <module>
main()
File "run.py", line 14, in main
foo()
File "run.py", line 11, in foo
bar()
File "run.py", line 8, in bar
baz()
File "run.py", line 4, in baz
raise Exception("Hi there")
Exception: Hi there
Instead, since Crashpad does not understand Python, we would get something like this. This is a native or C traceback.
#0 0x00007f9026e263a3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1 0x000057da6483182e in pysleep (secs=<optimized out>) at ../Modules/timemodule.c:1408
#2 time_sleep () at ../Modules/timemodule.c:231
…
#12 call_function (oparg=<optimized out>, pp_stack=0x7ffdb1512450) at ../Python/ceval.c:4745
#13 PyEval_EvalFrameEx () at ../Python/ceval.c:3251
#14 0x000057da6478193f in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffdb1512580,
func=<optimized out>) at ../Python/ceval.c:4818
#15 call_function (oparg=<optimized out>, pp_stack=0x7ffdb1512580) at ../Python/ceval.c:4745
#16 PyEval_EvalFrameEx () at ../Python/ceval.c:3251
#17 0x000057da64786286 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4033
#18 0x000057da64786f9f in PyEval_EvalCodeEx () at ../Python/ceval.c:4054
#19 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at ../Python/ceval.c:777
#20 0x000057da648548f2 in run_mod () at ../Python/pythonrun.c:976
#21 0x000057da64856e1d in PyRun_FileExFlags () at ../Python/pythonrun.c:929
#22 0x000057da648575be in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:396
#23 0x000057da648854d7 in run_file (p_cf=0x7ffdb15127f0, filename=0x57da65cca260 L"dummy.py", fp=0x57da65d30780)
at ../Modules/main.c:318
#24 Py_Main () at ../Modules/main.c:768
#25 0x000057da64715c01 in main () at ../Programs/python.c:65
#26 0x00007f9026d652e1 in __libc_start_main (main=0x57da64715b20 <main>, argc=2, argv=0x7ffdb1512a08, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdb15129f8) at ../csu/libc-start.c:291
#27 0x000057da6481f1ba in _start ()
Some native code has crashed, but that native code was triggered by Python code, and it is impossible to see what that Python code was.
We would have to teach Crashpad about Python if we wanted to use this solution. Could we get to real python function names and lines of code from here? I’ll spare you the suspense; yes we can!
The way crashpad works with the OS, the crashing program is suspended, with all it’s memory still valid, and we can introspect it. The program is potentially in a bad state, so we can’t execute any code within it. Instead we need to:
- Figure out how Python data structures are laid out in memory.
- Walk the right data structures to figure out what code is running when the program crashed.
- Store this useful information and upload it to our servers.
This is where my journey started. Fortunately, The Python interpreter is one of the easier interpreters out there as it doesn’t have a JIT (Just In Time interpreter) and the GIL (Global Interpreter Lock) takes care of a lot of locking, so most interpreter code runs in one thread and is easy to reason about.
Walking through execution
Here is the code that created the stack we saw before
def baz():
time.sleep(100)
def bar():
baz()
def foo():
bar()
def main():
foo()
main()
and here is the cleaned up stack.
#0 __select_nocancel ()
#1 pysleep
#2 time_sleep
#3 call_function
#4 PyEval_EvalFrameEx
#5 fast_function
#6 call_function
#7 PyEval_EvalFrameEx
#8 fast_function
#9 call_function
#10 PyEval_EvalFrameEx
#11 fast_function
#12 call_function
#13 PyEval_EvalFrameEx
#14 fast_function
#15 call_function
#16 PyEval_EvalFrameEx
#17 _PyEval_EvalCodeWithName
#18 PyEval_EvalCodeEx
#19 PyEval_EvalCode
Does any one notice any parallels?
There is a set of C functions per Python function call! We can infer that for every Python frame that
needs to be executed, PyEval_EvalFrameEx
is eventually called.
(To clarify, I am assuming Python 3.5 in this talk.)
I’m going to dig into some Python bytecode in the rest of this talk, which is the internal representation that is evaluated by the Python virtual machine. If you are not aware of this, I recommend listening to James Bennett’s talk - A Bit about Bytes, from PyCon.
PyEval_EvalFrameEx
is the workhorse of the Python interpreter.
def PyEval_EvalFrameEx(frame):
code = f.f_code
while True:
op = next_instruction()
if op == LOAD_FAST:
...
elif op == BINARY_ADD:
...
elif op == CALL_FUNCTION:
...
To be very clear, this function is actually written in C, and is about 3000 lines long. In the next couple of slides, I’m going to use pretend Python versions to make it easier to understand.
Ok, so given some function code, this function is responsible for actually matching against every bytecode operation in it and executing it.
It actually operates on a frame, which is a specific structure we are going to look at in just a bit.
There are various call sites that end up executing a frame, which are responsible for
creating a frame and then calling PyEval_EvalFrameEx
.
For our example, the top level evaluation encounters a CALL_FUNCTION
opcode, which leads to the
call_function
and fast_function
calls in the native stack. Here fast_function
is responsible
for creating a new frame and requesting evaluation.
def call_function(...):
fast_function()
def fast_function():
t = PyThreadState_GET()
f = PyFrameObject()
f.f_back = t.frame
t.frame = f
...
PyEval_EvalFrameEx(f)
del f
That is how we end up with the stack trace we saw before.
Let’s now look at what these frames are and how things are set up in the interpreter.
Linked lists all the way down
Each Python interpreter has a top level, PyInterpreterState
structure.
typedef struct _is {
...
struct _ts *tstate_head;
} PyInterpreterState;
In typical applications there is only one of these. This interpreter stores things common to the
whole program. Each Python thread then has a PyThreadState
that contains thread specific
information. Python threads map 1:1 to native threads, so each thread has its own stack, both in
native code and in Python. A PyThreadState
looks like:
typedef struct _ts {
/* See Python/ceval.c for comments explaining most fields */
struct _ts *prev;
struct _ts *next;
PyInterpreterState *interp;
struct _frame *frame;
...
long thread_id; /* Thread id where this tstate was created */
...
} PyThreadState;
Here we see that the thread state has a back reference to the interpreter, a reference to a frame, and the thread identifier. In addition it has these prev and next pointers to other thread states.
C programmers will immediately recognize this as a common pattern called intrusive linked lists. Python structures are linked lists all the way down. The collection of PyThreadStates is maintained as a doubly linked list with each thread being added to the beginning of the list.
When a new thread is created, it is set up with the right references, and inserted to the front of the interpreter’s thread list.
Each thread needs to maintain information about the code running in it. This is done at the level of
function calls, and each function execution is represented by a frame. As you make nested function
calls, the thread state maintains them as a stack of these PyFrameObjects. The top most frame at any
given time is pointed to by the thread state using the frame
pointer.
Each frame points to the frame that precedes it, using the f_back
pointer, creating a stack with
the head of the linked list being the top of the stack. Storing everything as linked lists allows
for operational ease within the interpreter.
CPython also exposes frames to Python code. The PyObject_VAR_HEAD
prefix indicates
that PyFrameObjects are PyObjects like any others, allowing attribute access via Python.
typedef struct _frame {
PyObject_VAR_HEAD
struct _frame *f_back;
PyCodeObject *f_code;
PyObject *f_builtins;
PyObject *f_globals;
PyObject *f_locals;
PyObject **f_valuestack;
PyObject **f_stacktop;
PyObject *f_trace;
PyObject *f_exc_type, *f_exc_value, *f_exc_traceback;
PyObject *f_gen;
int f_lasti;
int f_lineno;
int f_iblock;
char f_executing;
PyTryBlock f_blockstack[CO_MAXBLOCKS];
PyObject *f_localsplus[1];
} PyFrameObject;
Being the basic unit of execution, frames are pretty complex! This is a large structure, and allocating and initializing it can get expensive. The Python interpreter does several optimizations here, including a dramatically named zombie frame, that I encourage you to look into.
Fortunately some of it is bookkeeping information for loops, exceptions, generators and tracing so let’s remove all that and focus on the simplest case.
Frame internals
To execute the simplest function, the frame needs the following.
typedef struct _frame {
PyObject_VAR_HEAD
/* lots of properties elided */
struct _frame *f_back;
PyCodeObject *f_code;
PyObject **f_valuestack;
int f_lasti;
PyObject *f_localsplus[1];
} PyFrameObject;
f_code
is the actual code of the function that this frame is evaluating.
The localsplus array is the actual operating area of a frame. Its real size depends on the code being executed, which is why it has to be the last member in the struct. Space is allocated for any local variables and then stack space is assigned based on analysis during compilation. For this talk, we are going to assume the stack size used is correct. f_valuestack is offset into f_localsplus so stack operations can ignore the locals and have a well defined bottom of the stack.
If you are familiar with native stack frames, you can think of f_valuestack as the frame pointer, with arguments below the frame pointer and stack variables above it.
Finally, f_lasti refers to the last executed instruction. I will cover it later when we look at reversing stacks.
f_code
maintains information about the actual code running in this frame.
The valuestack, stacktop and f_localsplus fields actually manage the frames “scratch space” - the
set of locals, stack allocated variables and other values unique to the frame. Relating this to the
native execution model, this is the real “stack” for the frame. Unlike the native stack, which is a
contiguous piece of memory, each frame generally has its own stack, allocated at the end of the
PyFrameObject itself. f_localsplus
points to this chunk of memory and f_valuestack
is offset to go
past local variables and arguments to the first usable stack space. f_stacktop
initially points to
valuestack, but usually is NULL by the time the frame instructions are being evaluated.
(In the talk and slides, I walk through an execution of a simple set of functions. Reproducing this in text is time-consuming, so I encourage you to look at the slides/video for this.)
Phew! That was a lot of information! That completes our journey into Python frames. We have seen how frames are stored, what information they contain and how basic frame execution works.
There are several standard library modules that let you dig into frames, including inspect and traceback.
With everything that we just covered, we can implement some of their functionality ourselves. This program accesses the top frame in each thread using the sys._current_frames() function, then walks the back pointers to print a stack, similar to what you may get on an exception.
def print_stack():
for tid, frame in sys._current_frames().items():
# skip this frame
frame = frame.f_back
while frame:
code = frame.f_code
print("File {}, line {}, in {}".format(
code.co_filename,
frame.f_lineno,
code.co_name,
))
frame = frame.f_back
def bar():
print_stack()
def foo():
bar()
def main():
foo()
if __name__ == '__main__':
main()
File dummy.py, line 17, in bar
File dummy.py, line 20, in foo
File dummy.py, line 23, in main
File dummy.py, line 26, in <module>
We access the various attributes of the code object to print the name and line number. f_lineno
is
a convenience attribute provided by Python. As we saw before, the frame itself only has f_lasti
.
Alright! That concludes our dive into the interpreter.
Crash reporting
At Dropbox, we used almost everything we learnt so far to successfully capture Python stacks from Crashpad. We were only interested in capturing the function name and line number for each thread, so we don’t need to concern ourselves with the byte code evaluation. We looked at how the data structures are set up in the interpreter. Since they are set up as simple lists, we can ignore all the native C functions the perform the evaluation. That is, we don’t care what the native stack looks like. We are just going to find the structs we care about in memory and work with them.
Remember that Crashpad is a powerful platform that allows us to read the process’s memory, so we can directly access the Python structures we looked at today.
The final problem to solve is, given our crashed program, obtaining a reliable starting point to acquire a reference to a thread state. Unfortunately the interpreter state, the thread states and the frames are all dynamically allocated at essentially random memory addresses. So where should we start?
It’s all just code! I cannot stress this enough. Several times in my career I’ve been blocked on a problem only to realize that it is OK to change code to do what you want it to do!
That said, remember that every time you use something in an undocumented way, you acquire a dependency that could potentially break.
Using thread local storage
In our case, Python interpreter stores the interpreter state in a global autoInterpreterState
and
it also stores each thread’s PyThreadState
in thread local storage, indexed by autoTLSKey
,
another global.
Thread-local storage is a simple dictionary like structure provided either by OS, that allows multiple threads to use one shared key to access different values. The key itself is the same across threads, but the values are unique. Crashpad captures the thread local storage during crash capture. In fact, we upstreamed a bug fix that made this work correctly on Windows.
In Python, the key to store the thread state is in the autoTLSKey
variable in pystate.c. The value
is a pointer to a thread state, and is different for thread 1 vs thread 2. If we can access
autoTLSkey in a deterministic way, we can access the thread state for each thread.
We leveraged the executable formats on modern OSes to add something called a section, which you can
think of as a named range of memory in the executable. This way we can read metadata in the
executable and look up the address where autoTLSkey is stored. We modified the Python interpreter to
add a named section to indicate where autoTLSKey
is stored, then looked up the section in
Crashpad.
The next step is to reverse-engineer how Mac and Windows implement TLS, which was a fun problem in itself. We can then find pointers to the PyThreadStates. From there we are in business!
The thread state gives us a reference to the top of the Python stack. From there we walk the
f_back
pointers to get the full stack.
Once we have a frame, the final step is to resolve it to a code location. Each PyFrame is associated with a Python function and information about the code for the function is in the code object.
typedef struct {
...
int co_firstlineno;
...
PyObject *co_filename;
PyObject *co_name;
PyObject *co_lnotab;
...
} PyCodeObject;
The filename and function name are strings. Decoding Python strings could be a talk in itself; here we just assume they are ASCII and read out bytes.
Getting the line number is slightly more complicated. f_lasti is the last byte code instruction executed. co_firstlineno in the code object stores the actual line number in the source file of the first line of this code. To save space, while being able to map every byte code instruction to Python source, co_lnotab is a table storing these as relative offsets from co_firstlineno. We have to use this to do the translation just like Python does.
Conclusion
That’s it! We can put all this together and watch crashes become useful again!
We went from the crash report we saw before to a real python traceback, exactly as if it had been generated by Python itself.
This kind of low level tinkering was not only really fun, but the presence of reliable and actionable crash reporting makes Dropbox engineers significantly more productive, and allows us to spot issues really quickly.
In the past year we have gone from having a low crash report rate, and often unactionable crashes, to generally very actionable and useful crash reporting. We have leveraged several other crashpad features to capture other kinds of errors like Python interpreter issues to problems loading native libraries. Dropbox finished one of the largest Python 2 to Python 3 migrations and our new crash reporting infrastructure was instrumental in letting us have the confidence that major bugs would not sneak out to our users.
Related work
This kind of reverse engineering of the interpreter is not just useful for crash reporting. While we had a very specific need here, the knowledge we’ve covered today is widely used across languages to implement sampling profilers. A sampling profiler is a performance investigation tool that inspects the program very frequently, does a similar stack unwind to determine which functions are currently executing, and uses that to generate a report of the most frequently used or longest running functions in your program.
Python provides in-built ways to profile programs using the sys.setprofile
hook, but sometimes you
can’t change the source code, or you can’t restart a program, or you aren’t willing to take the
performance hit.
Enter pyflame and py-spy. These are sampling profilers that use exactly the same techniques we saw today, and that Crashpad uses, of reading program memory from outside the program. Since the program is assumed to be in a good state when these profilers run, they actually execute code inside the program to figure out where the thread states are located, so they don’t need to change the interpreter. I encourage reading their sources.
Wrapping up
For more background on everything I discussed today, it is best to read the Python source code.
The crashpad design offers a good overview of its capabilities and how to use it in your application.
I’d like to thank my colleagues Max Belanger and Jonathan Chien who helped with the experiments and code reviews to get this shipping.
The Crashpad team has created a really good and easy to extend crash reporting system and are very responsive to questions and feedback, hats off to them!