Understanding Text and Binary Files
00:00 In this lesson, you’re going to understand what it means to open a file in text or binary mode in Python. Broadly speaking, files on our computer can contain either human-readable text or binary data designed for machines, even when they both represent the same piece of information.
00:19 Some examples of text files might include your Python source files, HTML files, or CSV data files exported from a spreadsheet program. To give you an idea of binary files, think of audio and video data images or executable machine code, none of which are text. These can be sound waves, pixels, or instructions for a computer processor. By the way, don’t diffuse plain text files with rich-text format documents such as Microsoft Word, LibreOffice Writer, or Google Docs.
00:53 These can store additional text formatting data like font size, text alignment, bullet points, and sometimes visual elements like tables, charts, and so on.
01:03 Those elements don’t usually have a meaningful representation in text, as they take the form of numbers meant to be read by a computer program that knows how to display them. So even though what you’re looking at consists of text, primarily it is not considered a plain text file.
01:22 Here, over on the left, you have a sample text file stored in the comma-separated value format. It contains some personal expenses. When you import that file into the office software of your choice and save it as a spreadsheet, then you’ll end up with a binary file whose content under the surface might look similar to the one on the right.
01:42 These are numbers without any meaningful textual representation. When you try to open such a binary file in a text editor, then a few things can happen. First, your editor might recognize it’s dealing with a binary file.
01:56 It’ll just refuse to open it. Alternatively, it may try to map each number into a character, which will almost certainly result in a bunch of gibberish that doesn’t make any sense. Finally, your editor can display the values of the individual bytes, for example, using hexadecimal digits like here on the slide.
02:17 Note that from a technical point of view, there’s no real difference between text and binary files as they both consist of bytes representing some numbers.
02:26 It’s only a matter of how you and your software decide to interpret these numbers, which to some extent is arbitrary. However, this also means that you can get things wrong in binary mode unless you know the underlying file structure.
02:40 Many commercial programs deliberately use proprietary binary file formats without disclosing their internal structure to lock you into a particular product. As a result, it becomes difficult, if not impossible, to open your files using unofficial software unless someone successfully reverse-engineers the file format at hand. If you zoom in on the word cash, for example, in the text file on the left, then you won’t see any numbers just yet.
03:08 It’s because your text editor conveniently replaces each number it finds in the file with a corresponding character before showing it to you.
03:18
However, you can reveal the file’s actual byte values using a command-line tool like hexdump
. As the name implies, the tool dumps hexadecimal values of bytes in the given file.
03:31 So, for example, the first byte in the file has a hex value of 63 or 99 in decimal, which is the numeric code for the lowercase letter c in the ASCII coding.
03:43 ASCII stands for American Standard Code for Information Interchange, and it’s by far the most common character encoding system used for English text documents. It’s also one of the oldest and not the only character encoding in use today.
03:56 You’ll learn more about character encodings in the next lesson.
04:01
Note that you can use Python’s ord()
and chr()
built-in functions to double-check if this number-character relationship holds. ord()
returns the character’s ordinal value, while chr()
returns the corresponding character.
04:18
When you open a file in Python, either with a built-in open()
function or the Path.open()
method, you have the choice of specifying whether you want Python to treat the file as a stream of human-readable characters or generic bytes. In other words, you can read the same file using either text or binary mode in Python.
04:37 In the text mode, Python will automatically take care of translating the sequences of bytes into meaningful characters wrapped in Python string objects, and it will let you read the text line by line, which, although possible, doesn’t make much sense in binary mode.
04:53 On the other hand, binary mode lets you read the raw bytes as integers from the file without any translation. This can be convenient if you want to manipulate the bytes directly, for example, when processing an image. Now, how do you specify which mode to open the file in Python?
05:12
By default, if you don’t pass any arguments to .open()
, Python will open the file in text mode for reading. You can verify the file mode by inspecting the return file object’s .mode
attribute, and you can find out if it’s readable or writable by calling the corresponding methods. When you execute this code, you’ll see the letter r
, which stands for readable, appear in the output.
05:36
It is the default value for the mode
argument, which you can set explicitly when calling the .open()
method or function. When the mode
attribute doesn’t say otherwise, the file will be opened in text mode. Although text mode is assumed implicitly, you can include the extra letter code "t"
to indicate the text mode more explicitly if you really want to. However, because the letter "t"
is implied, you can leave it out and almost never use it again in practice.
06:04
To open your file in binary mode, you must replace the letter "t"
with a letter "b"
, as in the word Barbara.
06:12 Note that you can’t have both text and binary modes set at the same time because they’re mutually exclusive. You’ll learn about a few other letter codes for the available file modes in Python and when to use them in an upcoming lesson. Also, from now on, you’ll only be considering text files in this course, so you won’t have to worry about the binary mode anymore.
06:36 Now, you might face a few problems that are only relevant to opening files in text mode. They’ll manifest themselves when you rely on the defaults provided by Python.
06:46 These default values can be different for different people depending on their operating system. Specifically, the two parameters that can cause problems are the file’s character encoding and line ending. Python will make a best guess when you don’t specify them, but it’s generally recommended to set them by hand.
07:07 Since you understand the concept of text files a little better, you are ready to dive into character encoding and learn why and how to specify one in Python.
Become a Member to join the conversation.