Python Gotcha: Idiomatic file iteration has bad performance
Here is a performance footgun I encountered at work in a more complicated form.
Python allows iterating over a file object. However, this iteration is defined as yielding lines, regardless of if the file is a text or binary file. In fact IOBase
specifically says:
IOBase (and its subclasses) supports the iterator protocol, meaning that an IOBase object can be iterated over yielding the lines in a stream. Lines are defined slightly differently depending on whether the stream is a binary stream (yielding bytes), or a text stream (yielding character strings). See readline() below.
and then goes on to say that in a binary file, b'\n'
will be used as the separator, while in text files, it can be controlled by the argument to open()
.
While convenient, this is far from optimal for I/O intensive programs. I would say this default behavior isn’t a good idea for anything larger than human-created text files, and its presence in the base class of all I/O objects is certainly some kind of path dependence. Throughout the ecosystem, there are lots of places that iterate over files this way, and there are other places that go out of there way to inefficiently implement this contract, leaving a lot of performance on the table.
Take this simple comparison:
# read_iter.py
with open("ubuntu.iso", "rb") as f:
for chunk in f:
pass
# read_bytes.py
with open("ubuntu.iso", "rb") as f:
while True:
data = f.read(1024 * 1024)
if not data:
break
The first reads using readline()
and the second reads in 1MB chunks.
I ran this on a 4.6GB Ubuntu 23.04 AMD64 ISO, as a random example of a large binary file.
> hyperfine -r 5 'python3 read_iter.py' 'python3 read_bytes.py'
Benchmark 1: python3 read_iter.py
Time (mean ± σ): 2.419 s ± 0.094 s [User: 1.717 s, System: 0.701 s]
Range (min … max): 2.332 s … 2.558 s 5 runs
Benchmark 2: python3 read_bytes.py
Time (mean ± σ): 379.8 ms ± 13.5 ms [User: 9.5 ms, System: 370.0 ms]
Range (min … max): 365.2 ms … 394.8 ms 5 runs
Summary
'python3 read_bytes.py' ran
6.37 ± 0.33 times faster than 'python3 read_iter.py'
6 times faster!
To pick on a widely used library, and the one that bit me, requests as recently as 3 weeks ago, was using this form to iterate over the request body when it was chunked.
for i in request.body:
low_conn.send(hex(len(i))[2:].encode("utf-8"))
low_conn.send(b"\r\n")
low_conn.send(i)
low_conn.send(b"\r\n")
low_conn.send(b"0\r\n\r\n")
This is particularly egregious because it ends up descending into the send()
system call, which means it pays for both read and write costs, slowing things down on both ends.
Fortunately it delegates to urllib3 now, which seems to do it better.
So be careful of the libraries you are using when operating on large files in Python and prefer to explicitly use read()
with a good block size when possible.
Surprisingly, going to reading in 4MB or 10MB chunks is actually slower than reading in 1MB chunks, but that is an investigation for another time.