The Anatomy of a File
00:00 Let’s talk a little bit about files. Since you’ve been using a computer, you’ve interacted with files. You know, probably, that there’s lots of different types of files.
00:11 You’ve opened text files, watched movies from a movie file, and looked at pictures in a PNG or a JPEG format. And there’s many other types of files as well.
00:21 But what is a file really? As a programmer, it might be interesting to think about what’s underneath all those different file formats. And the truth is that really all that a file is is a sequence of bits. Down here you see a sequence of bits.
00:36 This would be a unit of meaning that’s often eight bits are grouped together as one byte. And this is what you see down here. You see eight bits that’s read from right to left.
00:47 You have 00010010, and that’s a binary number. In an 8-bit encoding system, you have numbers from 0 to 255 if you think of them in a decimal system. And somewhere in there in between, a couple of bits are switched on—two of them specifically here—and the rest are switched off.
01:07 Now I’ll give you a moment. You can pause this video and try to figure out what is this binary number if you wanted to represent it in a decimal system. And if you don’t care, you just want the solution, then just keep watching. Here it is.
01:20 This is the number 72 in the decimal system that we are much more familiar with dealing with. This number, 72, encodes for something in a certain encoding format, and text files often use an encoding format such as ASCII or UTF-8.
01:37 And in these formats, 72 encodes for a character. So when your computer encounters this sequence in a file, the text program knows how to translate it to a character.
01:49 And this is really what happens when you’re opening a file. So if you open a text file that includes this sequence of bits, then you will see a character. Now to exemplify this a little with an example, let’s head over to IDLE.
02:03 So I’ve written two small functions. We don’t have to care about what’s in there, but they’re just going to read a text file and then print out bits and bytes and the numbers, depending on which flags we’re passing.
02:15 So let me show you this …
And I’ll import it with an alias because it will be easier to type. So if I run
print_bits), then the function is going to read in a text file and show you the bits that it consists of.
02:35 And here’s your sequence of bits that makes up the content of this text file. Now it’s really hard to even know where does a unit of meaning start and end here. And different systems can also treat this differently.
02:48 They don’t have to necessarily group it in eight bits to encode for some meaning, but this is a common way of doing it. So let’s print them out in the unit of bytes, and for that I have a different function.
03:07 This function reads the same text file, but now it just groups the bits in there in units of eight, so now it’s a little easier to read, and you may remember this number from the slide before.
03:18 So this is the first character in this text file, and that’s the number 72. So let’s go ahead and print out the decimal representations of all of these bytes in there.
As you can see, the first one is the decimal number
72, and then there’s a couple of other ones. And these numbers in UTF-8 and ASCII—those are different encodings—they encode for text character data. And I can show you now, finally, what these characters are
03:54 so that you know what’s the content of this file that you’re reading here.
04:02 And here it is! The content of this file
that is represented as this long sequence of bits as a text file encodes for string,
Hello, World!. But now if you think about this, that every piece of data on a computer is stored as a sequence of bits, but now how to interpret the sequence of bits depends on an encoding and depends on what you do with the data that’s there.
You don’t have to group them in units of eight, even though this is a common way of doing it. And this number, this binary number, does not have to encode for the character
H. In different file types, it encodes for different things, but you as a higher-level programmer, you don’t have to really worry about these bits.
04:46 Python has ways and libraries to handle movie data and image data and text data for you so that you generally don’t have to interact with it on this bits or bytes level. But it’s still helpful to know what’s going on underneath.
05:00 And the big takeaway here is really that there are many types of files, but under the hood, they all consist of a sequence of bits, and how they’re read depends on the program and the programming and the encoding that you’re working with. And that’s part one. In the next lesson, you’re going to look at the file system and figure out what is the file system and how does it influence what is a file and how to read files.
Become a Member to join the conversation.