Doing mmap Operations
In the previous lesson, I gave you a rough introduction to using mmap and some of the consequences of interacting with files that way. In this lesson, I’ll dive deeper, covering most of mmap’s methods. Mentally, I think of the
mmap call as a function that returns a handle.
But I think that’s just my bias when it comes to things named with small letters, or it could be, I used to do this in C, which is the case there. It’s actually a class, and you’re constructing an
mmap object. There is something a little unusual about the
mmap object, though.
It has different
__init__() calls, depending on what operating system you’re using. There are some common values that work on all platforms, and if you’re worried about code compatibility, you should stick to these.
This is one of several kinds of flags that affect how the mapping is done. This is the only flag that is common to both Windows and Unix. The choices for this flag are:
ACCESS_READ, the block is to be read read-only.
ACCESS_WRITE, you can read and write from the block, and when you write to the block, the underlying file gets updated.
ACCESS_COPY, you can read and write from the block, but when you write to the block, you are only impacting the memory copy, not the corresponding file. And
On Windows, this is the same as
ACCESS_WRITE. On Unix, it is a reference to another flag, which I’ll touch on in a minute. The final argument is
offset. This defaults to
0, and if you set it higher, the memory block will start mapping that many bytes from the start of the file.
This allows you to construct objects with different parameters—say, different offsets—each with its own tag. You should avoid this if you want cross-OS compatible code. And as I mentioned before, the
ACCESS_DEFAULT flag means the same as
ACCESS_WRITE when you’re in Windows land.
There are other flags as well, but they aren’t common across all Unix variants. Details are available in the documentation. Like with tag names in Windows, you’re better off not using this if you can avoid it. The
prot flag indicates the protection mode.
If you’re using
ACCESS_DEFAULT on a Unix box, it behaves based on what flags you’ve given the
prot argument. As
prot defaults to read or write, this is similar to the Windows behavior unless you change the value of
You’ve seen the basics of mapping a file to a memory block and getting at its contents, but you can also search within the block. It even a search-from-the-right variant. You can use regular expressions. And you can do file-like operations: readings bytes or lines, writing bytes or lines, and managing the file pointer using
All right, the file is open and mapped. Let’s quickly review the arguments to the object being constructed. The first argument is the file number, which is achieved by calling
.fileno() on the handle returned by the open call. Setting
0 says to get a block the same size as the
.find() method searches the memory block for the bytes you give it. Remember everything is a byte array here, so you have to search for bytes, not a string. If I hadn’t used the
b prefix here, I would get an exception.
5 in response tells me that I found
"no" in the fifth position. That’s the piece that I have conveniently highlighted for you. Again, television magic, or whatever this is. As this is a byte array, it is zero-indexed. So
5 means the sixth position.
.tell() method indicates where the file pointer is currently positioned. The
.find() method doesn’t move the file pointer, so it is still at its default position at the front of the file. I can move the file pointer by using the
the finding starts at the file pointer position. The second instance of the
"no" are at position
48. Because the file pointer was to the right of the first instance, the second instance is what was found.
It indicates that
"parrot" is at position
38. Let’s take a look at that position. You can use the mmap object like the byte array that it represents, grabbing position
38 directly. A quick look at an ASCII table will tell you that
112 is the letter
07:44 You can also do slices. Notice that what is coming back is a binary value. It’s nice and readable because this binary contains ASCII codes, but it would be a bunch of hex values if you got out of the printable ASCII range.
08:09 By directly addressing this slice, I can overwrite the bytes. The parrot is now a dead magpie instead. Mm, pie. Anyhow, if you want more complex searching behavior, the mmap object can work with the
I find this a bit weird. I have to keep remembering I’m dealing with binary, but then here’s a handy thing that is usually for text situations.
.readline() also advances the file pointer, so you could loop on this and process your block this way if you wanted to.
As I’m not using a context manager, I now need to clean up after myself. mmap doesn’t guarantee to write immediately when you call its methods. It might get buffered. If you’re doing an active process where you want to make sure things get written, you call
.flush(), and when you’re done with the memory block, you close it and then the file as well. You don’t have to flush if you’re closing, like I did here. Closing will automatically flush any buffers.
If you want to go past the boundary, you would use the
.resize() method, or you would, if your OS supports it. The
.resize() call corresponds to a different underlying C method, which isn’t available in most operating systems. Linux implements it, though, in some cases.
11:28 You have to stay with the boundaries of the block that you loaded. You might have noticed that I didn’t do any mid-block cutting. Everything was an overwrite. You can’t just snip out part of the block.
12:02 If the kind of editing you’re doing sticks to straight overwrites, you’re going to get a performance boost through the use of mmap. The more you have to pull parts out and create copies, the more this advantage will diminish.
Become a Member to join the conversation.