The Anatomy of a File

00:00 Let’s talk a little bit about files. Since you’ve been using a computer, you’ve interacted with files. You know, probably, that there’s lots of different types of files.

00:11 You’ve opened text files, watched movies from a movie file, and looked at pictures in a PNG or a JPEG format. And there’s many other types of files as well.

00:21 But what is a file really? As a programmer, it might be interesting to think about what’s underneath all those different file formats. And the truth is that really all that a file is is a sequence of bits. Down here you see a sequence of bits.

00:36 This would be a unit of meaning that’s often eight bits are grouped together as one byte. And this is what you see down here. You see eight bits that’s read from right to left.

00:47 You have 00010010, and that’s a binary number. In an 8-bit encoding system, you have numbers from 0 to 255 if you think of them in a decimal system. And somewhere in there in between, a couple of bits are switched on—two of them specifically here—and the rest are switched off.

01:07 Now I’ll give you a moment. You can pause this video and try to figure out what is this binary number if you wanted to represent it in a decimal system. And if you don’t care, you just want the solution, then just keep watching. Here it is.

01:20 This is the number 72 in the decimal system that we are much more familiar with dealing with. This number, 72, encodes for something in a certain encoding format, and text files often use an encoding format such as ASCII or UTF-8.

01:37 And in these formats, 72 encodes for a character. So when your computer encounters this sequence in a file, the text program knows how to translate it to a character.

01:49 And this is really what happens when you’re opening a file. So if you open a text file that includes this sequence of bits, then you will see a character. Now to exemplify this a little with an example, let’s head over to IDLE.

02:03 So I’ve written two small functions. We don’t have to care about what’s in there, but they’re just going to read a text file and then print out bits and bytes and the numbers, depending on which flags we’re passing.

02:15 So let me show you this …

02:22 And I’ll import it with an alias because it will be easier to type. So if I run pb() (print_bits), then the function is going to read in a text file and show you the bits that it consists of.

02:35 And here’s your sequence of bits that makes up the content of this text file. Now it’s really hard to even know where does a unit of meaning start and end here. And different systems can also treat this differently.

02:48 They don’t have to necessarily group it in eight bits to encode for some meaning, but this is a common way of doing it. So let’s print them out in the unit of bytes, and for that I have a different function.

03:07 This function reads the same text file, but now it just groups the bits in there in units of eight, so now it’s a little easier to read, and you may remember this number from the slide before.

03:18 So this is the first character in this text file, and that’s the number 72. So let’s go ahead and print out the decimal representations of all of these bytes in there.

03:35 As you can see, the first one is the decimal number 72, and then there’s a couple of other ones. And these numbers in UTF-8 and ASCII—those are different encodings—they encode for text character data. And I can show you now, finally, what these characters are

03:54 so that you know what’s the content of this file that you’re reading here.

04:02 And here it is! The content of this file

04:07 that is represented as this long sequence of bits as a text file encodes for string, Hello, World!. But now if you think about this, that every piece of data on a computer is stored as a sequence of bits, but now how to interpret the sequence of bits depends on an encoding and depends on what you do with the data that’s there.

04:29 You don’t have to group them in units of eight, even though this is a common way of doing it. And this number, this binary number, does not have to encode for the character H. In different file types, it encodes for different things, but you as a higher-level programmer, you don’t have to really worry about these bits.

04:46 Python has ways and libraries to handle movie data and image data and text data for you so that you generally don’t have to interact with it on this bits or bytes level. But it’s still helpful to know what’s going on underneath.

05:00 And the big takeaway here is really that there are many types of files, but under the hood, they all consist of a sequence of bits, and how they’re read depends on the program and the programming and the encoding that you’re working with. And that’s part one. In the next lesson, you’re going to look at the file system and figure out what is the file system and how does it influence what is a file and how to read files.

Dmitrii on June 14, 2023

Hi!

I was trying to import/pip install the bite module but could not find it. Could you please give a link to this module?

Thank you!

Dick de Goede on June 26, 2023

If you like to simulate the bite exercise you can create a helloworld.txt file with the string “Hello, World!” in it and in the same directory/folder a file bite.py with the code below:

def print_bits(sourcefile='helloworld.txt'):
    """Show binary string of contents"""
    file = open(sourcefile)
    content = file.read()
    file.close()
    print(''.join(format(ord(i), '08b') for i in content))


def print_file_content(sourcefile='helloworld.txt', decimal=False, characters=False):
    """Show binary string of contents (byte per row)"""
    file = open(sourcefile)
    content = file.read()
    file.close()

    for i in content:
        if not decimal:
            print(format(ord(i), '08b'))
        else:
            if not characters:
                print(f'{ord(i):08b} {ord(i):>3d}')
            else:
                print(f'{ord(i):08b} {ord(i):>3d} {i:>5}')

Now you can use the examples to get the output via IDLE as in the video:

from bite import print_bits as pb
from bite import print_file_content as pfc
pb()
pfc()
pfc(decimal=True)
pfc(decimal=True, characters=True)

Note that I created this purely for this video as it lacks checks and error handling but it is intended to play around with and extend to your likings, hope this will help :)

Martin Breuss RP Team on June 26, 2023

@Dimitrii like @Dick de Goede mentioned, this is just some custom throwaway code that I wrote to demonstrate the point. I managed to unearth some version of that file, but it might not be quite the code I used in the recording 🤔 Anyway, here it is for some reference:

# bite.py

def print_file_content(ints=False, bytes=False):
    with open("hello.txt", "rb") as f:
        while (byte := f.read(1)):
            print('{0:08b}'.format(ord(byte)), end=" ")
            if ints and bytes:
                print(f"{ord(byte):3}", end="\t")
                print(byte)
            elif ints:
                print(f"{ord(byte):3}")
            elif bytes:
                print(byte)
            else:
                print()

def print_bytes():
    with open("hello.txt", "rb") as f:
        while (byte := f.read(1)):
            print('{0:08b}'.format(ord(byte)), end="")
        print()

@Dick de Goede’s nice job taking on the challenge and writing it yourself! 🙌

Dick de Goede on June 27, 2023

Thank you @Martin, I am not very experienced yet but it was nice to find out and now I can compare your code with mine and learn from that too :)

Become a Member to join the conversation.