Debugging MacOS file locking with DTrace

Posted on Aug 15, 2017

A few days ago, I was stymied at work by a set of tests that had intermittent failures on OSX but not Windows.

There was a process which would try to obtain an exclusive lock on a file, using the lock-on-open provided by the BSD/MacOS O_EXLOCK flag to open(2). It also used O_NONBLOCK; if the file was locked by another process, it could be skipped. The process would hold the lock and remove (unlink(2)) the file, before close(2)-ing the descriptor. My understanding was that only one process would be able to access the file, ever. The test harness did not agree. It was clear that if multiple processes managed to race on the same file, they could both end up with valid file descriptors. I should clarify that just realizing this took a lot of print statements over several hours over 2 days. It is unusual to run into unexpected behavior from core system calls, which made me doubt myself a lot.

It was easy to fix the test harness; we had unique IDs in the file and we could just de-duplicate across that. However, the problem kept haunting me (it was making me question my understanding of operating systems!), and so I kept digging around at home.

The first task was to write an extremely simple program that would reproduce the problem. Interacting with POSIX is really easy in Rust and absolves you of having to do C string manipulations, so that is what I used.

Reproducing the problem

(All code was run on OSX 10.12.5 with HFS+.)

The full code is online, including the sample output included here and the DTrace script. Here are the core file operations.

let mut file = std::fs::OpenOptions::new()
    .read(true)
    .custom_flags(0x20 /* EXLOCK */ | libc::O_NONBLOCK)
    .open(entry.path());
match file {
    Ok(_) => {
        println!("{} Opened {:?}", pid, entry.path());
    }
    Err(err) => {
        println!("{} {:?} {:?}", pid, entry.path(), err);
        continue;
    }
}
std::thread::sleep(std::time::Duration::from_millis(2));
match std::fs::remove_file(entry.path()) {
    Ok(_) => {}
    Err(err) => {
        println!("!!!!! {} Could not remove {:?} {:?}",
                 pid,
                 entry.path(),
                 err);
    }
}

We try to open the file with the requisite flags, and stop if it fails. We sleep for a bit to simulate the work the original program was doing, and also because it helps make the problem easier to reproduce. We try to remove the file. Finally, we rely on std::fs::File’s destructor close(2)ing the file descriptor when file goes out of scope.

Running two instances of this program on some sample files would sometimes lead to output like:

79011 Opened "./0.sample"
79012 Opened "./0.sample"
79011 Opened "./1.sample"
!!!!! 79012 Could not remove "./0.sample" Error { repr: Os { code: 2, message: "No such file or directory" } }

The first column on each line is the Process ID. This didn’t make any sense. 79011 was opening the file successfully, which meant it must be locked. Its unlink wasn’t failing, and yet, 79012 had managed to successfully open a file that had been deleted (as evidenced by the inability to remove it). If the open was succeeding, that meant 79011 had already released the lock, so it must have removed the file. If you are following along with the code, you may need to run the program as instructed in the README, several times, to see the issue happen.

Standard print statements and debuggers are useless to debug this. With two different processes, we can’t rely on the individual logs to be ordered, which means we can’t reason precisely. Since debuggers can’t cross system call boundaries, we can’t use them to peek under the hood. Plus, a debugger may throw off the timing and not cause the problem to reproduce. Enter DTrace.

DTrace is a tracing framework originally designed for Solaris, that allows inspecting system internals at a very low cost, and is zero cost when disabled. It was ported to MacOS a few years ago. I’ve been aware of DTrace for a few years, but never had the opportunity to use it till now.

I’m no DTrace expert, so I first turned to the scripts OSX ships with (which are all based on work by Brendan Gregg), to see if there was something that would do the job. Turns out there is a wrapper called opensnoop that tracks open(2) calls, including filtering by PID or process names. We want to track not just open(2), but also close(2) and unlink(2). The modified script is uncreatively called openclosesnoop ¹. Here is the sample output from the above run (You’ve to launch the snoop before you run the program).

$ sudo ./openclosesnoop -a -g -F -n osx-unlink-violation
# re-aligned for readability.
open   57958449740    2017 Aug  3 20:58:05 605572857  79011   3 0x01000024 0 ./0.sample osx-unlink-violation\0
unlink 57958452261    2017 Aug  3 20:58:05 605572857  79011   0 0          0 ./0.sample osx-unlink-violation\0
close  57958452307    2017 Aug  3 20:58:05 605572857  79011   0 0          0 3          osx-unlink-violation\0
open   57958452354    2017 Aug  3 20:58:05 605572857  79012   3 0x01000024 0 ./0.sample osx-unlink-violation\0
unlink 57958454769    2017 Aug  3 20:58:05 605572857  79012  -1 0          2 ./0.sample osx-unlink-violation\0
close  57958454884    2017 Aug  3 20:58:05 605572857  79012   0 0          0 3          osx-unlink-violation\0

We see that our intuitions are at least correct in terms of syscall ordering. We also have a reliable ordering of events, and DTrace shows us the arguments an results of the system calls. All these lines are printed when the system call returns. The output columns are:

system call name
a timestamp maintained by DTrace in microseconds since some arbitrary point in the past. It should only be used for relative ordering.
Wall time.
UID
PID
Return value of the system call.
Flags passed to system call. 0 for unlink and close.
errno. 0 indicates success.
Path for open and unlink, file descriptor for close.
Program name

We see that the open calls succeed and lead to file descriptor 3 for both programs. The 79011 unlink succeeds, the 79012 fails. Both are able to open and close the file. Clearly the locking isn’t being violated, nor is unlink; the second call fails. What else could be wrong? Let’s try tracking not just when system calls return, but when they are entered. Here is the output from another run, with only open being tracked for entry, since that is the one messing up our assumptions.

126740840573   75938 open entry ./7.sample
open   126740840651   2017 Aug  6 10:35:49 605572857  75938   3 0x01000024 0 ./7.sample osx-unlink-viola\0
126740843072   75939 open entry ./7.sample
unlink 126740843128   2017 Aug  6 10:35:49 605572857  75938   0 0          0 ./7.sample osx-unlink-viola\0
close  126740843155   2017 Aug  6 10:35:49 605572857  75938   0 0          0 3          osx-unlink-viola\0
open   126740843189   2017 Aug  6 10:35:49 605572857  75939   3 0x01000024 0 ./7.sample osx-unlink-viola\0
unlink 126740845607   2017 Aug  6 10:35:49 605572857  75939  -1 0          2 ./7.sample osx-unlink-viola\0
close  126740845698   2017 Aug  6 10:35:49 605572857  75939   0 0          0 3          osx-unlink-viola\0

That is very interesting! 75939 enters open pretty soon after 75938 has successfully opened 7.sample. Then, 75939 is scheduled off the CPU, 75938’s unlink and close calls are scheduled and run to completion. 75939’s open then completes successfully! Its attempt to unlink fails, since the file has been unlinked and can no longer be accessed by file path.

What’s going on?

Now that we know we enter open(2) then get booted off CPU, let’s go back and read the man page.

When opening a file, a lock with flock(2) semantics can be obtained by setting
O_SHLOCK for a shared lock, or O_EXLOCK for an exclusive lock.  If creating a
file with O_CREAT, the request for the lock will never fail (provided that the
underlying filesystem supports locking).

open(2) is acquiring a file descriptor with no locks. It manages to do this before 75938’s unlink has occurred. This ensures that 75939 now has the file open, so unlink will not reclaim resources, but the link count will get decremented to zero and the file no longer accessible by path. When control returns to this process (or rather to the system call running on its behalf), the file’s lock (remember the file is independent of link count, usually identified by an inode or similar) has been released and the flock(2) portion finishes successfully.

That is, open(O_EXLOCK | O_NONBLOCK | O_RDONLY) maps roughly to:

int fd = open(O_RDONLY | O_NONBLOCK);
if (fd == -1)
  return;
// stops here while unlink and close occur. the first flock is released and the
// one below succeeds.
int r = flock(fd, LOCK_EX | LOCK_NB);
return r;

and behaves just like the above would if it ran in userspace. The kernel provides no atomicity on the path name.

The root cause was a failure to read the fine print, and filling in assumptions where things were left unsaid. POSIX does not require open(2) to be atomic in the general case. Certain pathname operations are atomic, but only the open(path, O_CREAT | O_EXCL, ...) operation falls into that category.

This was a really fun puzzle to solve. It is remarkable how easy it is to debug issues when you can see exactly what the kernel is doing, with great precision and fidelity. I’ve mainly used DTrace for syscall tracing, but it is capable of a lot more, including inspecting I/O events, mutex/lock states and process scheduling. MacOS ships with several out-of-the-box scripts that can be listed with man -k dtrace. I highly encourage getting familiar with the tool. If you are on Linux, eBPF is a similar alternative.

It took me a while to realize close(2) maps to close_nocancel in DTrace. ↩︎