00:00 In the previous lesson, I explained the vast difference in speed between the various parts of your computer, gave an overview on different kinds of memory, and briefly touched on file I/O. In this lesson, I’ll show you how all those things come together when you use the mmap module to map file contents into memory blocks.
00:20 If you were to write a function that did edits to a file, that function would likely read the file into a Python object (a string, a byte array, or something similar), make changes to the object, then serialize that object back onto the disk.
00:46 Instead, the mmap module provides an alternative way. It reads the file into a block of memory, which is abstracted by an mmap object, then operates directly on that object, meaning both the memory representation and the disk representation change.
01:05 This is both kind of simpler and kind of more complicated. You’ve got less steps happening, so you might get a performance gain, but you’re a little more restricted on what kinds of things you can do.
All of this is being done inside of a context manager—that’s the
with statement—so that the file automatically will be closed upon exiting the context block. Now, into the REPL in the bottom window.
02:10 and call it. So far, so good. Twenty-three million characters. Now let’s look at the mmap equivalent. New function up top. The first thing you’ll probably notice is there are two context managers here.
Like before, the file to be read is opened. That’s line 5. The new bit is where the file handle from the open file is used in a call to create an
mmap object from the mmap module. Like files, this object has to be closed. So like files, it gets put in a context manager to make sure everything is cleaned up automatically. mmap doesn’t use a file handle.
02:52 It uses a file number, which you can get from the file handle itself. In addition to the file number, it also takes a size and an access flag. Giving a length of zero, like I did here, you will get back a block of memory the same size as the file being mapped.
The access flag is similar to the mode indicator in opening a file. I’ll go into much more detail about this flag later. Inside of the mmap context block, I’m doing pretty much the same thing as I did in the
03:39 and called it. Pretty similar. You’ll notice the amount of data looks different. mmap objects represent bytes, not strings. Python strings are in Unicode, and that means they may take up more than a single byte for a character.
05:00 There are a bunch of variables impacting the outcome. First, you’ll get different performance based on file size. Second, you’ll get different performance on different hardware due to what kinds of caches you have.
05:14 Third is how your OS has implemented the mmap call. Depressingly for me, there is a known issue in the macOS mmap call that makes it significantly slower than running Linux on the same hardware.
05:27 A colleague of mine running the same code on Windows was consistently getting ten times improvement. Do note that what I’m doing here is just using mmap to read some data and stuff it into a Python object. Although this might get you a performance boost, it is still stuffing things into a Python object.
05:56 In that little demo, I yada yadaed the whole characters and bytes thing. Let’s dig into it a bit more. The mmap call uses a byte array representation. That means it sees everything as the bytes that make up the block, regardless of what the data represents. In the case of a Unicode string, a single character may be more than a single byte.
06:18 That means you have to be careful how you read or write your data. The boundaries between characters might not be what you expect. If you’re dealing with text data that is pure ASCII, you can get away with a one character-one byte assumption, but otherwise need to be careful. If you’d asked me before running the previous code, I would’ve sworn the Don Quixote file was pure ASCII.
06:42 But the character count didn’t match the byte count, so there’s something in there outside of the ASCII range—over seventy kilobytes of something, in this case. Let’s go back into the REPL and see how this can mess you up.
Okay. I ran it on
monty.txt, which has
39 characters of content. The first character is
N, the sixth character is
y, and the whole string is
Nobody expects the Spanish Inquisition. Watch out. I hear they tickle. Now for function number two.
Because this is a chunk of binary rather than a string, Python prints it using the byte notation, a quoted value with a
b prefix. And you can see the newline at the end of it. That pesky newline?
08:32 Yeah, it was there before, but in the string version, it caused the gap between the output and the next REPL prompt. Subtle. You could easily miss that. Let’s try these two functions with some different data.
snake.txt as a string, this time the length is
26. Remember that’s in characters. The first character is a cute little snake, the sixth is an
e, and the whole thing is filled with emoji goodness.
f0 is just the first of those four. The sixth bite is an ASCII character though, so you can see the
m. The string representation of this gets quite messy because the bytes are printed as, well, bytes instead of the ASCII equivalents.
Become a Member to join the conversation.