The Anatomy of a File
00:21 But what is a file really? As a programmer, it might be interesting to think about what’s underneath all those different file formats. And the truth is that really all that a file is is a sequence of bits. Down here you see a sequence of bits.
00:47 You have 00010010, and that’s a binary number. In an 8-bit encoding system, you have numbers from 0 to 255 if you think of them in a decimal system. And somewhere in there in between, a couple of bits are switched on—two of them specifically here—and the rest are switched off.
01:07 Now I’ll give you a moment. You can pause this video and try to figure out what is this binary number if you wanted to represent it in a decimal system. And if you don’t care, you just want the solution, then just keep watching. Here it is.
01:20 This is the number 72 in the decimal system that we are much more familiar with dealing with. This number, 72, encodes for something in a certain encoding format, and text files often use an encoding format such as ASCII or UTF-8.
01:49 And this is really what happens when you’re opening a file. So if you open a text file that includes this sequence of bits, then you will see a character. Now to exemplify this a little with an example, let’s head over to IDLE.
02:03 So I’ve written two small functions. We don’t have to care about what’s in there, but they’re just going to read a text file and then print out bits and bytes and the numbers, depending on which flags we’re passing.
02:35 And here’s your sequence of bits that makes up the content of this text file. Now it’s really hard to even know where does a unit of meaning start and end here. And different systems can also treat this differently.
02:48 They don’t have to necessarily group it in eight bits to encode for some meaning, but this is a common way of doing it. So let’s print them out in the unit of bytes, and for that I have a different function.
As you can see, the first one is the decimal number
72, and then there’s a couple of other ones. And these numbers in UTF-8 and ASCII—those are different encodings—they encode for text character data. And I can show you now, finally, what these characters are
that is represented as this long sequence of bits as a text file encodes for string,
Hello, World!. But now if you think about this, that every piece of data on a computer is stored as a sequence of bits, but now how to interpret the sequence of bits depends on an encoding and depends on what you do with the data that’s there.
You don’t have to group them in units of eight, even though this is a common way of doing it. And this number, this binary number, does not have to encode for the character
H. In different file types, it encodes for different things, but you as a higher-level programmer, you don’t have to really worry about these bits.
04:46 Python has ways and libraries to handle movie data and image data and text data for you so that you generally don’t have to interact with it on this bits or bytes level. But it’s still helpful to know what’s going on underneath.
05:00 And the big takeaway here is really that there are many types of files, but under the hood, they all consist of a sequence of bits, and how they’re read depends on the program and the programming and the encoding that you’re working with. And that’s part one. In the next lesson, you’re going to look at the file system and figure out what is the file system and how does it influence what is a file and how to read files.
Become a Member to join the conversation.