Diving into the Python call stack (PyGotham 2018)

Posted on Nov 16, 2018

I gave a talk at PyGotham 2018 about how Python implements stack frames and how Dropbox leveraged that to improve crash reporting using Crashpad. I also contributed to the Dropbox Tech Blog post that goes into great detail on the crash reporting pipeline.

The talk was less about Crashpad and more about Python internals. There is a video and slides. As part of preparing for the talk, I wrote the following post. It started off to get my thoughts in order and not be distracted by making slides, so the tone is more suitable to speaking than writing. The finishing touches were made in speaker notes and migrated back to here.

Disclaimer: Words are my own and do not represent the views of my employer.

While I have used Python for small scripts for over a decade, it was only since starting at Dropbox in 2017 that it has been my primary language. At Dropbox I work on the desktop client, a majority of which is written in Python. Bugs are inevitable, even in a language as nice as Python. Since we interoperate with the operating system so much, a lot of our Python code calls into native code. This code can dereference null pointers, libraries can mis-use types or there may be weird thread interactions, all of which can lead to crashes.

Our crash reporting solution for several years was simply a signal handler for things like segmentation faults. This python signal handler would attempt to extract python stacks for all threads and upload them. This doesn’t always work:

If our application failed before the signal handler was installed, we didn’t get reports.
In-process handlers are restricted due to potentially corrupted state of the program and limited resources (signal safety). In addition it is really easy to end up in deadlock or violate other invariants if one isn’t careful since signals indiscriminately interrupt normal execution.

In late 2016 we started to migrate to Crashpad, an solution developed and used by the Google Chrome team. Crashpad is a crash reporting solution that runs as a separate process. We can start it really really early in program execution, before Python has even been initialized. Once started, it just sits in the background, waiting for Dropbox to crash. Since it runs out of process it can reliably observe all crashes. When it detects a crash, it captures a lot of useful information like machine details, various parts of the program state and some custom attributes we add indicating the Dropbox version and so on. This is uploaded to our servers.

The server side tooling for processing its crashes is also mature. This would significantly improved our crash reporting rate and quality.

Crashpad is designed for native applications, so it only captures the C stack. Suddenly, while engineers had more reliable crash reporting, they also had pretty useless reporting.

What I mean is, in a language like Python, when an exception is thrown, you get a nice traceback pointing exactly where things went wrong.

Traceback (most recent call last):
  File "run.py", line 16, in <module>
    main()
  File "run.py", line 14, in main
    foo()
  File "run.py", line 11, in foo
    bar()
  File "run.py", line 8, in bar
    baz()
  File "run.py", line 4, in baz
    raise Exception("Hi there")
Exception: Hi there

Instead, since Crashpad does not understand Python, we would get something like this. This is a native or C traceback.

#0  0x00007f9026e263a3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1  0x000057da6483182e in pysleep (secs=<optimized out>) at ../Modules/timemodule.c:1408
#2  time_sleep () at ../Modules/timemodule.c:231
…
#12 call_function (oparg=<optimized out>, pp_stack=0x7ffdb1512450) at ../Python/ceval.c:4745
#13 PyEval_EvalFrameEx () at ../Python/ceval.c:3251
#14 0x000057da6478193f in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffdb1512580,
                func=<optimized out>) at ../Python/ceval.c:4818
#15 call_function (oparg=<optimized out>, pp_stack=0x7ffdb1512580) at ../Python/ceval.c:4745
#16 PyEval_EvalFrameEx () at ../Python/ceval.c:3251
#17 0x000057da64786286 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4033
#18 0x000057da64786f9f in PyEval_EvalCodeEx () at ../Python/ceval.c:4054
#19 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at ../Python/ceval.c:777
#20 0x000057da648548f2 in run_mod () at ../Python/pythonrun.c:976
#21 0x000057da64856e1d in PyRun_FileExFlags () at ../Python/pythonrun.c:929
#22 0x000057da648575be in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:396
#23 0x000057da648854d7 in run_file (p_cf=0x7ffdb15127f0, filename=0x57da65cca260 L"dummy.py", fp=0x57da65d30780)
                    at ../Modules/main.c:318
#24 Py_Main () at ../Modules/main.c:768
#25 0x000057da64715c01 in main () at ../Programs/python.c:65
#26 0x00007f9026d652e1 in __libc_start_main (main=0x57da64715b20 <main>, argc=2, argv=0x7ffdb1512a08, init=<optimized out>,
                        fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdb15129f8) at ../csu/libc-start.c:291
#27 0x000057da6481f1ba in _start ()

Some native code has crashed, but that native code was triggered by Python code, and it is impossible to see what that Python code was.

We would have to teach Crashpad about Python if we wanted to use this solution. Could we get to real python function names and lines of code from here? I’ll spare you the suspense; yes we can!

The way crashpad works with the OS, the crashing program is suspended, with all it’s memory still valid, and we can introspect it. The program is potentially in a bad state, so we can’t execute any code within it. Instead we need to:

Figure out how Python data structures are laid out in memory.
Walk the right data structures to figure out what code is running when the program crashed.
Store this useful information and upload it to our servers.

This is where my journey started. Fortunately, The Python interpreter is one of the easier interpreters out there as it doesn’t have a JIT (Just In Time interpreter) and the GIL (Global Interpreter Lock) takes care of a lot of locking, so most interpreter code runs in one thread and is easy to reason about.

Walking through execution

Here is the code that created the stack we saw before

def baz():
    time.sleep(100)

def bar():
    baz()

def foo():
    bar()

def main():
    foo()

main()

and here is the cleaned up stack.

#0  __select_nocancel ()
#1  pysleep
#2  time_sleep
#3  call_function
#4  PyEval_EvalFrameEx
#5  fast_function
#6  call_function
#7  PyEval_EvalFrameEx
#8  fast_function
#9  call_function
#10 PyEval_EvalFrameEx
#11 fast_function
#12 call_function
#13 PyEval_EvalFrameEx
#14 fast_function
#15 call_function
#16 PyEval_EvalFrameEx
#17 _PyEval_EvalCodeWithName
#18 PyEval_EvalCodeEx
#19 PyEval_EvalCode

Does any one notice any parallels?

There is a set of C functions per Python function call! We can infer that for every Python frame that needs to be executed, PyEval_EvalFrameEx is eventually called.

(To clarify, I am assuming Python 3.5 in this talk.)

I’m going to dig into some Python bytecode in the rest of this talk, which is the internal representation that is evaluated by the Python virtual machine. If you are not aware of this, I recommend listening to James Bennett’s talk - A Bit about Bytes, from PyCon.

PyEval_EvalFrameEx is the workhorse of the Python interpreter.

def PyEval_EvalFrameEx(frame):
  code = f.f_code
  while True:
    op = next_instruction()
    if op == LOAD_FAST:
        ...
    elif op == BINARY_ADD:
        ...
    elif op == CALL_FUNCTION:
        ...

To be very clear, this function is actually written in C, and is about 3000 lines long. In the next couple of slides, I’m going to use pretend Python versions to make it easier to understand.

Ok, so given some function code, this function is responsible for actually matching against every bytecode operation in it and executing it.

It actually operates on a frame, which is a specific structure we are going to look at in just a bit. There are various call sites that end up executing a frame, which are responsible for creating a frame and then calling PyEval_EvalFrameEx.

For our example, the top level evaluation encounters a CALL_FUNCTION opcode, which leads to the call_function and fast_function calls in the native stack. Here fast_function is responsible for creating a new frame and requesting evaluation.

def call_function(...):
  fast_function()

def fast_function():
  t = PyThreadState_GET()
  f = PyFrameObject()
  f.f_back = t.frame
  t.frame = f
  ...
  PyEval_EvalFrameEx(f)
  del f

That is how we end up with the stack trace we saw before.

Let’s now look at what these frames are and how things are set up in the interpreter.

Linked lists all the way down

Each Python interpreter has a top level, PyInterpreterState structure.

typedef struct _is {
    ...
    struct _ts *tstate_head;
} PyInterpreterState;

In typical applications there is only one of these. This interpreter stores things common to the whole program. Each Python thread then has a PyThreadState that contains thread specific information. Python threads map 1:1 to native threads, so each thread has its own stack, both in native code and in Python. A PyThreadState looks like:

typedef struct _ts {
    /* See Python/ceval.c for comments explaining most fields */

    struct _ts *prev;
    struct _ts *next;
    PyInterpreterState *interp;

    struct _frame *frame;
		...
    long thread_id; /* Thread id where this tstate was created */
		...
} PyThreadState;

Here we see that the thread state has a back reference to the interpreter, a reference to a frame, and the thread identifier. In addition it has these prev and next pointers to other thread states.

C programmers will immediately recognize this as a common pattern called intrusive linked lists. Python structures are linked lists all the way down. The collection of PyThreadStates is maintained as a doubly linked list with each thread being added to the beginning of the list.

When a new thread is created, it is set up with the right references, and inserted to the front of the interpreter’s thread list.

Each thread needs to maintain information about the code running in it. This is done at the level of function calls, and each function execution is represented by a frame. As you make nested function calls, the thread state maintains them as a stack of these PyFrameObjects. The top most frame at any given time is pointed to by the thread state using the frame pointer.

Each frame points to the frame that precedes it, using the f_back pointer, creating a stack with the head of the linked list being the top of the stack. Storing everything as linked lists allows for operational ease within the interpreter.

CPython also exposes frames to Python code. The PyObject_VAR_HEAD prefix indicates that PyFrameObjects are PyObjects like any others, allowing attribute access via Python.

typedef struct _frame {
    PyObject_VAR_HEAD
    struct _frame *f_back;
    PyCodeObject *f_code;
    PyObject *f_builtins;
    PyObject *f_globals;
    PyObject *f_locals;
    PyObject **f_valuestack;
    PyObject **f_stacktop;
    PyObject *f_trace;

    PyObject *f_exc_type, *f_exc_value, *f_exc_traceback;
    PyObject *f_gen;

    int f_lasti;
    int f_lineno;
    int f_iblock;
    char f_executing;
    PyTryBlock f_blockstack[CO_MAXBLOCKS];
    PyObject *f_localsplus[1];
} PyFrameObject;

Being the basic unit of execution, frames are pretty complex! This is a large structure, and allocating and initializing it can get expensive. The Python interpreter does several optimizations here, including a dramatically named zombie frame, that I encourage you to look into.

Fortunately some of it is bookkeeping information for loops, exceptions, generators and tracing so let’s remove all that and focus on the simplest case.

Frame internals

To execute the simplest function, the frame needs the following.

typedef struct _frame {
    PyObject_VAR_HEAD

    /* lots of properties elided */

    struct _frame *f_back;
    PyCodeObject *f_code;

    PyObject **f_valuestack;
    int f_lasti;
    PyObject *f_localsplus[1];
} PyFrameObject;

f_code is the actual code of the function that this frame is evaluating.

The localsplus array is the actual operating area of a frame. Its real size depends on the code being executed, which is why it has to be the last member in the struct. Space is allocated for any local variables and then stack space is assigned based on analysis during compilation. For this talk, we are going to assume the stack size used is correct. f_valuestack is offset into f_localsplus so stack operations can ignore the locals and have a well defined bottom of the stack.

If you are familiar with native stack frames, you can think of f_valuestack as the frame pointer, with arguments below the frame pointer and stack variables above it.

Finally, f_lasti refers to the last executed instruction. I will cover it later when we look at reversing stacks.

f_code maintains information about the actual code running in this frame.

The valuestack, stacktop and f_localsplus fields actually manage the frames “scratch space” - the set of locals, stack allocated variables and other values unique to the frame. Relating this to the native execution model, this is the real “stack” for the frame. Unlike the native stack, which is a contiguous piece of memory, each frame generally has its own stack, allocated at the end of the PyFrameObject itself. f_localsplus points to this chunk of memory and f_valuestack is offset to go past local variables and arguments to the first usable stack space. f_stacktop initially points to valuestack, but usually is NULL by the time the frame instructions are being evaluated.

(In the talk and slides, I walk through an execution of a simple set of functions. Reproducing this in text is time-consuming, so I encourage you to look at the slides/video for this.)

Phew! That was a lot of information! That completes our journey into Python frames. We have seen how frames are stored, what information they contain and how basic frame execution works.

There are several standard library modules that let you dig into frames, including inspect and traceback.

With everything that we just covered, we can implement some of their functionality ourselves. This program accesses the top frame in each thread using the sys._current_frames() function, then walks the back pointers to print a stack, similar to what you may get on an exception.

def print_stack():
  for tid, frame in sys._current_frames().items():
    # skip this frame
    frame = frame.f_back
    while frame:
      code = frame.f_code
      print("File {}, line {}, in {}".format(
        code.co_filename,
        frame.f_lineno,
        code.co_name,
      ))
      frame = frame.f_back

def bar():
  print_stack()

def foo():
  bar()

def main():
  foo()

if __name__ == '__main__':
  main()

File dummy.py, line 17, in bar
File dummy.py, line 20, in foo
File dummy.py, line 23, in main
File dummy.py, line 26, in <module>

We access the various attributes of the code object to print the name and line number. f_lineno is a convenience attribute provided by Python. As we saw before, the frame itself only has f_lasti. Alright! That concludes our dive into the interpreter.

Crash reporting

At Dropbox, we used almost everything we learnt so far to successfully capture Python stacks from Crashpad. We were only interested in capturing the function name and line number for each thread, so we don’t need to concern ourselves with the byte code evaluation. We looked at how the data structures are set up in the interpreter. Since they are set up as simple lists, we can ignore all the native C functions the perform the evaluation. That is, we don’t care what the native stack looks like. We are just going to find the structs we care about in memory and work with them.

Remember that Crashpad is a powerful platform that allows us to read the process’s memory, so we can directly access the Python structures we looked at today.

The final problem to solve is, given our crashed program, obtaining a reliable starting point to acquire a reference to a thread state. Unfortunately the interpreter state, the thread states and the frames are all dynamically allocated at essentially random memory addresses. So where should we start?

It’s all just code! I cannot stress this enough. Several times in my career I’ve been blocked on a problem only to realize that it is OK to change code to do what you want it to do!

That said, remember that every time you use something in an undocumented way, you acquire a dependency that could potentially break.

Using thread local storage

In our case, Python interpreter stores the interpreter state in a global autoInterpreterState and it also stores each thread’s PyThreadState in thread local storage, indexed by autoTLSKey, another global.

Thread-local storage is a simple dictionary like structure provided either by OS, that allows multiple threads to use one shared key to access different values. The key itself is the same across threads, but the values are unique. Crashpad captures the thread local storage during crash capture. In fact, we upstreamed a bug fix that made this work correctly on Windows.

In Python, the key to store the thread state is in the autoTLSKey variable in pystate.c. The value is a pointer to a thread state, and is different for thread 1 vs thread 2. If we can access autoTLSkey in a deterministic way, we can access the thread state for each thread.

We leveraged the executable formats on modern OSes to add something called a section, which you can think of as a named range of memory in the executable. This way we can read metadata in the executable and look up the address where autoTLSkey is stored. We modified the Python interpreter to add a named section to indicate where autoTLSKey is stored, then looked up the section in Crashpad.

The next step is to reverse-engineer how Mac and Windows implement TLS, which was a fun problem in itself. We can then find pointers to the PyThreadStates. From there we are in business!

The thread state gives us a reference to the top of the Python stack. From there we walk the f_back pointers to get the full stack.

Once we have a frame, the final step is to resolve it to a code location. Each PyFrame is associated with a Python function and information about the code for the function is in the code object.

typedef struct {
    ...
    int co_firstlineno;  
    ...
    PyObject *co_filename;
    PyObject *co_name;
    PyObject *co_lnotab;
    ...
} PyCodeObject;

The filename and function name are strings. Decoding Python strings could be a talk in itself; here we just assume they are ASCII and read out bytes.

Getting the line number is slightly more complicated. f_lasti is the last byte code instruction executed. co_firstlineno in the code object stores the actual line number in the source file of the first line of this code. To save space, while being able to map every byte code instruction to Python source, co_lnotab is a table storing these as relative offsets from co_firstlineno. We have to use this to do the translation just like Python does.

Conclusion

That’s it! We can put all this together and watch crashes become useful again!

We went from the crash report we saw before to a real python traceback, exactly as if it had been generated by Python itself.

This kind of low level tinkering was not only really fun, but the presence of reliable and actionable crash reporting makes Dropbox engineers significantly more productive, and allows us to spot issues really quickly.

In the past year we have gone from having a low crash report rate, and often unactionable crashes, to generally very actionable and useful crash reporting. We have leveraged several other crashpad features to capture other kinds of errors like Python interpreter issues to problems loading native libraries. Dropbox finished one of the largest Python 2 to Python 3 migrations and our new crash reporting infrastructure was instrumental in letting us have the confidence that major bugs would not sneak out to our users.

This kind of reverse engineering of the interpreter is not just useful for crash reporting. While we had a very specific need here, the knowledge we’ve covered today is widely used across languages to implement sampling profilers. A sampling profiler is a performance investigation tool that inspects the program very frequently, does a similar stack unwind to determine which functions are currently executing, and uses that to generate a report of the most frequently used or longest running functions in your program.

Python provides in-built ways to profile programs using the sys.setprofile hook, but sometimes you can’t change the source code, or you can’t restart a program, or you aren’t willing to take the performance hit.

Enter pyflame and py-spy. These are sampling profilers that use exactly the same techniques we saw today, and that Crashpad uses, of reading program memory from outside the program. Since the program is assumed to be in a good state when these profilers run, they actually execute code inside the program to figure out where the thread states are located, so they don’t need to change the interpreter. I encourage reading their sources.

Wrapping up

For more background on everything I discussed today, it is best to read the Python source code.

The crashpad design offers a good overview of its capabilities and how to use it in your application.

I’d like to thank my colleagues Max Belanger and Jonathan Chien who helped with the experiments and code reviews to get this shipping.

The Crashpad team has created a really good and easy to extend crash reporting system and are very responsive to questions and feedback, hats off to them!