Using Windows Job Objects for Process Tree Management

Using child processes to perform various tasks is a standard construct in larger programs. The simplest reason is this gets you memory isolation and resource management for free, with the OS managing scheduling and file descriptors and other resources. A common requirement when using multiple processes is the ability to wait on or kill one or more of these children.

It is not always possible to record process IDs at fork(), since the fork may happen in a library that does not give you such access. Fortunately, on UNIX, we have a easy way out due to the way the process model is standardized.

For the remainder of this post we are concerned with the following question, given the above caveat: How do we kill all our child processes, but leave others untouched?

Each process has a well defined parent process, which can be obtained in-process by using getppid(). Accessing it out of process is more platform dependent, and usually involves shelling out to ps (on Linux this would read /proc/<pid>/stat). Once you do that and parse the output (abstracted away by various libraries, such as psutil for Python), it is easy to filter for processes which have the PPID as our own process ID (obtained by getpid()) and kill() them.

Windows also has a PID and PPID concept, although the functions to obtain these are different (hence psutil!). The devil is in the details! Process IDs are a finite resource and all operating systems will re-use them. Let’s say that some time ago, a process launched with PID = 5. It launched several children, then terminated. The children keep running. Eventually the PID space is exhausted and the kernel wraps around. A new process of the same application is now launched and ends up, entirely by coincidence, getting the PID 5. It could now run around killing these children, even though it is NOT the owner.

The finite process space makes PPID an ureliable way of identifying ownership. On UNIX we can use another aspect of processes to make this reliable, with zero effort.

On UNIX, if a parent spawns N children, then dies without wait()ing on them, the kernel will reparent them to PID 1 1. PID 1 is reserved for an init-like process, and since the kernel will not start without one and will halt when PID 1 dies, PID 1 is guaranteed to be present. getppid() in the children, or ps will show the PPID as 1 after such an event. In our hypothetical example. as soon as PID 5 terminates, all the children’s PPID is now 1. The new process with PID 5 can simply ignore them.

On Windows, there is no init and no hierarchy. So how can we solve the problem on Windows? Enter Job Objects. The documentation is very comprehensive, but here is what we need:

A job object allows groups of processes to be managed as a unit.

Sounds like what we want! In addition:

After a process is associated with a job, by default any child processes it creates using CreateProcess are also associated with the job.

This lets us preserve “tree-ness”. Finally, QueryInformationJobObject allows us to retrieve a list of process IDs associated with a job.

We can put all of this together to achieve what we want:

  1. Create a job object in the parent before it spawns any children. All children will now inherit the same job object.
  2. When we need to kill children, use QueryInformationJobObject() with the JobObjectBasicProcessIdList parameter to retrieve all process IDs associated with the job 2.

Here is an example:

print("parent started", os.getpid())            
job = CreateJobObject(None, "my-first-job")     
AssignProcessToJobObject(job, GetCurrentProcess())                                                  
for i in range(3):                              
    subprocess.Popen("python /child")   

raw_input("press any key to kill all child processes:")                                             

job_processes = QueryInformationJobObject(None, JobObjectBasicProcessIdList)                    
for pid in job_processes:                   
    if pid == os.getpid(): # Don't kill ourselves                                               
    child_handle = OpenProcess(PROCESS_TERMINATE, True, pid)                                    
    # Here you could use IsProcessInJob(child_handle, job) to be
    # absolutely sure.
    TerminateProcess(child_handle, 1)       
    print("Killed", pid)                    

The full code is on Github.

While writing this post, I also discovered UNIX process groups, which seem useful in certain cases.

  1. Here be dragons! Linux allows processes to customize this using prctl(). [return]
  2. There is still a small window between retrieving a list of PIDs and killing them, when a child could terminate, and a new process be launched with the same PID, in which case our logic goes for a toss. As long as your code isn’t doing a long sleep, this seems pretty unlikely. A mitigation would be to check that the PID was indeed part of a job using IsProcessInJob(). [return]