Debugging MacOS file locking with DTrace
A few days ago, I was stymied at work by a set of tests that had intermittent failures on OSX but not Windows.
There was a process which would try to obtain an exclusive lock on a file,
using the lock-on-open provided by the BSD/MacOS O_EXLOCK
flag to open(2)
.
It also used O_NONBLOCK
; if the file was locked by another process, it could
be skipped. The process would hold the lock and remove
(unlink(2)
) the file,
before
close(2)
-ing the descriptor. My understanding was that only one process would
be able to access the file, ever. The test harness did not agree. It was clear
that if multiple processes managed to race on the same file, they could both
end up with valid file descriptors. I should clarify that just realizing this
took a lot of print statements over several hours over 2 days. It is unusual to
run into unexpected behavior from core system calls, which made me doubt myself
a lot.
It was easy to fix the test harness; we had unique IDs in the file and we could just de-duplicate across that. However, the problem kept haunting me (it was making me question my understanding of operating systems!), and so I kept digging around at home.
The first task was to write an extremely simple program that would reproduce the problem. Interacting with POSIX is really easy in Rust and absolves you of having to do C string manipulations, so that is what I used.
Reproducing the problem
(All code was run on OSX 10.12.5 with HFS+.)
The full code is online, including the sample output included here and the DTrace script. Here are the core file operations.
let mut file = std::fs::OpenOptions::new()
.read(true)
.custom_flags(0x20 /* EXLOCK */ | libc::O_NONBLOCK)
.open(entry.path());
match file {
Ok(_) => {
println!("{} Opened {:?}", pid, entry.path());
}
Err(err) => {
println!("{} {:?} {:?}", pid, entry.path(), err);
continue;
}
}
std::thread::sleep(std::time::Duration::from_millis(2));
match std::fs::remove_file(entry.path()) {
Ok(_) => {}
Err(err) => {
println!("!!!!! {} Could not remove {:?} {:?}",
pid,
entry.path(),
err);
}
}
We try to open the file with the requisite flags, and stop if it fails. We
sleep for a bit to simulate the work the original program was doing, and also
because it helps make the problem easier to reproduce. We try to remove the
file. Finally, we rely on std::fs::File’s destructor close(2)
ing the file
descriptor when file
goes out of scope.
Running two instances of this program on some sample files would sometimes lead to output like:
79011 Opened "./0.sample"
79012 Opened "./0.sample"
79011 Opened "./1.sample"
!!!!! 79012 Could not remove "./0.sample" Error { repr: Os { code: 2, message: "No such file or directory" } }
The first column on each line is the Process ID. This didn’t make any sense. 79011 was opening the file successfully, which meant it must be locked. Its unlink wasn’t failing, and yet, 79012 had managed to successfully open a file that had been deleted (as evidenced by the inability to remove it). If the open was succeeding, that meant 79011 had already released the lock, so it must have removed the file. If you are following along with the code, you may need to run the program as instructed in the README, several times, to see the issue happen.
Standard print statements and debuggers are useless to debug this. With two different processes, we can’t rely on the individual logs to be ordered, which means we can’t reason precisely. Since debuggers can’t cross system call boundaries, we can’t use them to peek under the hood. Plus, a debugger may throw off the timing and not cause the problem to reproduce. Enter DTrace.
DTrace is a tracing framework originally designed for Solaris, that allows inspecting system internals at a very low cost, and is zero cost when disabled. It was ported to MacOS a few years ago. I’ve been aware of DTrace for a few years, but never had the opportunity to use it till now.
I’m no DTrace expert, so I first turned to the scripts OSX ships with (which
are all based on
work by Brendan Gregg), to see if there was something that
would do the job. Turns out there is a wrapper called opensnoop
that tracks
open(2)
calls, including filtering by PID or process names. We want to track
not just open(2)
, but also close(2)
and unlink(2)
. The modified script is
uncreatively called openclosesnoop
1. Here is the sample output from the above
run (You’ve to launch the snoop before you run the program).
$ sudo ./openclosesnoop -a -g -F -n osx-unlink-violation
# re-aligned for readability.
open 57958449740 2017 Aug 3 20:58:05 605572857 79011 3 0x01000024 0 ./0.sample osx-unlink-violation\0
unlink 57958452261 2017 Aug 3 20:58:05 605572857 79011 0 0 0 ./0.sample osx-unlink-violation\0
close 57958452307 2017 Aug 3 20:58:05 605572857 79011 0 0 0 3 osx-unlink-violation\0
open 57958452354 2017 Aug 3 20:58:05 605572857 79012 3 0x01000024 0 ./0.sample osx-unlink-violation\0
unlink 57958454769 2017 Aug 3 20:58:05 605572857 79012 -1 0 2 ./0.sample osx-unlink-violation\0
close 57958454884 2017 Aug 3 20:58:05 605572857 79012 0 0 0 3 osx-unlink-violation\0
We see that our intuitions are at least correct in terms of syscall ordering. We also have a reliable ordering of events, and DTrace shows us the arguments an results of the system calls. All these lines are printed when the system call returns. The output columns are:
- system call name
- a timestamp maintained by DTrace in microseconds since some arbitrary point in the past. It should only be used for relative ordering.
- Wall time.
- UID
- PID
- Return value of the system call.
- Flags passed to system call. 0 for unlink and close.
- errno. 0 indicates success.
- Path for open and unlink, file descriptor for close.
- Program name
We see that the open calls succeed and lead to file descriptor 3 for both programs. The 79011 unlink succeeds, the 79012 fails. Both are able to open and close the file. Clearly the locking isn’t being violated, nor is unlink; the second call fails. What else could be wrong? Let’s try tracking not just when system calls return, but when they are entered. Here is the output from another run, with only open being tracked for entry, since that is the one messing up our assumptions.
126740840573 75938 open entry ./7.sample
open 126740840651 2017 Aug 6 10:35:49 605572857 75938 3 0x01000024 0 ./7.sample osx-unlink-viola\0
126740843072 75939 open entry ./7.sample
unlink 126740843128 2017 Aug 6 10:35:49 605572857 75938 0 0 0 ./7.sample osx-unlink-viola\0
close 126740843155 2017 Aug 6 10:35:49 605572857 75938 0 0 0 3 osx-unlink-viola\0
open 126740843189 2017 Aug 6 10:35:49 605572857 75939 3 0x01000024 0 ./7.sample osx-unlink-viola\0
unlink 126740845607 2017 Aug 6 10:35:49 605572857 75939 -1 0 2 ./7.sample osx-unlink-viola\0
close 126740845698 2017 Aug 6 10:35:49 605572857 75939 0 0 0 3 osx-unlink-viola\0
That is very interesting! 75939 enters open
pretty soon after 75938 has
successfully opened 7.sample. Then, 75939 is scheduled off the CPU, 75938’s
unlink and close calls are scheduled and run to completion. 75939’s open then
completes successfully! Its attempt to unlink fails, since the file has been
unlinked and can no longer be accessed by file path.
What’s going on?
Now that we know we enter open(2)
then get booted off CPU, let’s go back and
read the man page.
When opening a file, a lock with flock(2) semantics can be obtained by setting
O_SHLOCK for a shared lock, or O_EXLOCK for an exclusive lock. If creating a
file with O_CREAT, the request for the lock will never fail (provided that the
underlying filesystem supports locking).
open(2)
is acquiring a file descriptor with no locks. It manages to do this
before 75938’s unlink has occurred. This ensures that 75939 now has the file
open, so unlink will not reclaim resources, but the link count will get
decremented to zero and the file no longer accessible by path. When control
returns to this process (or rather to the system call running on its behalf),
the file’s lock (remember the file is independent of link count, usually
identified by an inode or similar) has been released and the flock(2)
portion
finishes successfully.
That is, open(O_EXLOCK | O_NONBLOCK | O_RDONLY)
maps roughly to:
int fd = open(O_RDONLY | O_NONBLOCK);
if (fd == -1)
return;
// stops here while unlink and close occur. the first flock is released and the
// one below succeeds.
int r = flock(fd, LOCK_EX | LOCK_NB);
return r;
and behaves just like the above would if it ran in userspace. The kernel provides no atomicity on the path name.
The root cause was a failure to read the fine print, and filling in assumptions
where things were left unsaid. POSIX does not require open(2)
to be atomic in
the general case. Certain pathname
operations
are atomic, but only the open(path, O_CREAT | O_EXCL, ...)
operation falls
into that category.
This was a really fun puzzle to solve. It is remarkable how easy it is to debug
issues when you can see exactly what the kernel is doing, with great precision
and fidelity. I’ve mainly used DTrace for syscall tracing, but it is capable of
a lot more, including inspecting I/O events, mutex/lock states and process
scheduling. MacOS ships with several out-of-the-box scripts that can be listed
with man -k dtrace
. I highly encourage getting familiar with the tool. If you
are on Linux, eBPF is a similar alternative.
-
It took me a while to realize
close(2)
maps toclose_nocancel
in DTrace. ↩︎