Using Windows Job Objects for Process Tree Management
Using child processes to perform various tasks is a standard construct in larger programs. The simplest reason is this gets you memory isolation and resource management for free, with the OS managing scheduling and file descriptors and other resources. A common requirement when using multiple processes is the ability to wait on or kill one or more of these children.
It is not always possible to record process IDs at
fork(), since the fork
may happen in a library that does not give you such access. Fortunately, on
UNIX, we have a easy way out due to the way the process model is standardized.
For the remainder of this post we are concerned with the following question, given the above caveat: How do we kill all our child processes, but leave others untouched?
Each process has a well defined parent process, which can be obtained
in-process by using
getppid(). Accessing it out of process is more platform
dependent, and usually involves shelling out to
ps (on Linux this would read
/proc/<pid>/stat). Once you do that and parse the output (abstracted away by
various libraries, such as psutil
for Python), it is easy to filter for processes which have the PPID as our own
process ID (obtained by
Windows also has a PID and PPID concept, although the functions to obtain these are different (hence psutil!). The devil is in the details! Process IDs are a finite resource and all operating systems will re-use them. Let’s say that some time ago, a process launched with PID = 5. It launched several children, then terminated. The children keep running. Eventually the PID space is exhausted and the kernel wraps around. A new process of the same application is now launched and ends up, entirely by coincidence, getting the PID 5. It could now run around killing these children, even though it is NOT the owner.
The finite process space makes PPID an ureliable way of identifying ownership. On UNIX we can use another aspect of processes to make this reliable, with zero effort.
On UNIX, if a parent spawns N children, then dies without
wait()ing on them,
the kernel will reparent them to PID
1 1. PID
1 is reserved for an
init-like process, and since the kernel will not start without one and will
halt when PID
1 dies, PID
1 is guaranteed to be present.
getppid() in the
ps will show the PPID as
1 after such an event. In our
hypothetical example. as soon as PID 5 terminates, all the children’s PPID is
now 1. The new process with PID 5 can simply ignore them.
On Windows, there is no init and no hierarchy. So how can we solve the problem on Windows? Enter Job Objects. The documentation is very comprehensive, but here is what we need:
A job object allows groups of processes to be managed as a unit.
Sounds like what we want! In addition:
After a process is associated with a job, by default any child processes it creates using CreateProcess are also associated with the job.
This lets us preserve “tree-ness”. Finally,
allows us to retrieve a list of process IDs associated with a job.
We can put all of this together to achieve what we want:
- Create a job object in the parent before it spawns any children. All children will now inherit the same job object.
- When we need to kill children, use
JobObjectBasicProcessIdListparameter to retrieve all process IDs associated with the job 2.
Here is an example:
print("parent started", os.getpid()) job = CreateJobObject(None, "my-first-job") AssignProcessToJobObject(job, GetCurrentProcess()) for i in range(3): subprocess.Popen("python main.py /child") raw_input("press any key to kill all child processes:") job_processes = QueryInformationJobObject(None, JobObjectBasicProcessIdList) for pid in job_processes: if pid == os.getpid(): # Don't kill ourselves continue child_handle = OpenProcess(PROCESS_TERMINATE, True, pid) # Here you could use IsProcessInJob(child_handle, job) to be # absolutely sure. TerminateProcess(child_handle, 1) print("Killed", pid)
The full code is on Github.
While writing this post, I also discovered UNIX process groups, which seem useful in certain cases.
There is still a small window between retrieving a list of PIDs and killing them, when a child could terminate, and a new process be launched with the same PID, in which case our logic goes for a toss. As long as your code isn’t doing a long sleep, this seems pretty unlikely. A mitigation would be to check that the PID was indeed part of a job using