Examining File Contents
Let’s take a look at examining file contents. One common problem that you may face is the encoding of the byte data. An encoding is a translation from byte data to human readable characters.
This is typically done by assigning a numerical value to represent a character. The two most common encodings are the ASCII and UNICODE Formats. ASCII can only store 128 characters, while Unicode can contain up to 1,114,112 characters.
ASCII is a subset of Unicode (UTF-8), meaning that ASCII and Unicode share the same numerical to character values. It’s important to note that parsing a file with the incorrect character encoding can lead to failures or misrepresentation of the character.
For example, if a file was created using the UTF-8 encoding, and you try to parse it using the ASCII encoding, if there is a character that is outside of those 128 values, then an error will be thrown.
00:01
Examining the contents of some file types, the first of which is the text file, represented by .txt
. It contains a sequence of lines of electronic text. It’s widely supported and easy to use in Python.
00:16
As you’ve already seen earlier on, line endings can be different on different operating systems. Windows systems tend to use two characters, \r\n
, whereas macOS and Unix tend to just use \n
, the newline character. But let’s look at how that can cause some confusion.
00:36
So here we have a Windows system, and you can see that the program here is creating text which has \r\n
, a carriage return and a newline. Whereas the other program is creating ones which solely have newlines (\n
).
00:52 Now, running both of these will create two text files, and opening those up in a text editor, your mileage will vary. Here it’s open in Atom, and we can see Atom has dealt with this difference in lines and they read it identically. But if we look at them in Notepad, we can see we have extra blank lines on one of the files and not the other.
01:15 This is something you need to be aware of if you’re writing files which will be used on a different system. Character encodings. Characters can be encoded in files in multiple different ways.
01:26 Two common methods are ASCII and Unicode. ASCII only represents 128 characters, as seen onscreen. It’s common in simpler, older systems. Unicode is much more expansive, giving over a million possibilities for characters with many more characters, as seen onscreen.
01:47 It’s a modern, global character set. Python 3 defaults to using Unicode, so you shouldn’t have problems generating and reading any of these characters. However, if you’re importing files, you may find that the encoding does not match the default.
02:06
Here you can see a text file called 'uni.txt'
is being generated, and the encoding
is being explicitly stated as 'utf-8'
, which is the default Unicode setting in Python 3.
02:19 The content has some characters which would not work in the ASCII encoding, from a variety of different alphabets around the world. However, this code works perfectly in a modern Python 3 system.
02:34
We can execute it, it generates that file, and if we open it up in a text editor, you can see all of those characters are represented. However, if we try and change the encoding
to 'ascii'
, save the file, and then run it—we have a problem. ASCII can’t encode those characters, and we get a UnicodeEncodeError
.
03:04
The flip side of that is decoding. Here we will try and open that 'uni.txt'
file, and we’re explicitly stating the encoding as 'utf-8'
to read those Unicode text characters.
03:27
we can see Python doesn’t have a problem in reading those and printing them out. However, again, if we change our encoding
to 'ascii'
, ASCII isn’t capable of reading those characters,
03:43
and running it again generates a UnicodeDecodeError
, as it can’t decode the bytes within the file. While this is clearly a staged example, it is not uncommon to download files from the internet which are not in Python’s default encoding, and some detective work may be needed to ensure that the file can be read.
04:06
Next, you’ll see CSV files. CSV, comma-separated values, is a text file that uses comma (,
) to separate the values which are contained within it.
04:15
It’s often used to exchange data between applications, and a wide range of data sets are available in CSV on the internet. CSVs are easy to use in Python, whether directly as we’ll see here or using a module such as pandas
.
04:30
Now you’ll see the creation of a Python script which will open up a CSV file and step through each of the lines on it. As you’ve seen many times already, open('example.csv') as file:
and then the content
will be file.readlines()
, so they will all be present in there.
04:48
We’re now going to iterate through that content
using a for
loop, for line in content:
and we’re going to print each line
. To stop it all being printed all at once and whizzing past us with thousands of lines, we’re going to use a line where it says if input() == 'x':
, so i.e. if you type an x
on the keyboard then it will break
, otherwise you need to hit Enter to see the next line.
05:14 So now if we run this script…
05:23
we see the first line of our CSV file, where the headings are street
, city
, zip
, state
, beds
, et cetera. So these would be the columns if we were to import this into a spreadsheet.
05:33
Here’s the next line, with an address, city, zip code, state, et cetera—all of that information. And each time we hit Enter, you’ll see a new line. Hitting x
quits back to the prompt.
05:48
Next, Graphics Interchange Format, known as .gif
files, is a bitmapped graphics format which has been around for a very long time, it’s compressed, and it’s best handled using a library such as Pillow
.
06:03 However, you’re going to see what’s inside the file using the skills you’ve already gained on this course.
06:12
This is going to work in a very similar way to the CSV reader. We’re going to open up 'cat.gif'
, which was downloaded from Wikipedia,
06:23
and then read the content of that—all of those lines—into the content
variable and then iterate through that content
variable. for line in content:
and print each line
out, and then use that input()
statement to allow us to continue or to exit and to read each line. So again, running the script, and what do we get?
06:50
Ah. We get a UnicodeDecodeError
, and the reason is that this is a binary format, so we need to open it in a new mode. So rather than just 'r'
, as seen here, we’re going to open it up in read binary, 'rb'
, mode. That will allow us to access the contents of the file as binary.
07:11
Running the script again, there we see our first line. And you can see, we have that GIF87
, which is the identifier, and also NETSCAPE
. Anyone remember Netscape?
07:28 Looking through the rest of the lines, we can see we can’t make much sense of it, but this is the actual content of that file. We can see the line length varies depending on what’s happening, but we can quit out of that now.
07:42
And the final format you’re going to take a look at is MP3. MPEG 1 Audio Layer 3 is a compressed audio format which is extremely widespread. It’s made up of MP3 frames and sometimes an ID3 header at the beginning with information about the file. Now to play it, it would be best handled with something such pydub
or pygame
, but here we’re going to open it up in the same way and have a look at the actual data which is in the file. You’ll see it’s very similar to the graphics example, opening 'audio.mp3'
in read binary mode ('rb'
), putting all of the contents of the file into that contents
variable, and then iterating through that. for line in contents:
printing the line
, and then using the input()
to allow us to look at that. Hit Enter to carry on or x
if we want to quit.
08:32
Running the script allows us to see the first line of the MP3 file. We can see we get ID3
, which shows us that may have some ID information about what’s in the file.
08:47
Let’s look at the next line. Here we can see some information. The file has come from the YouTube Audio Library
. It’s got a name Impact Moderato
, and the writer. After that, we see this compressed audio data, which isn’t any use to us, maybe, at this point, but it is useful to know that you can access it at this low level with just a few simple lines of Python.
Darren Jones RP Team on July 16, 2019
Hi Abby. Do you have more info to share on your I/O error - the code you’re using or the error? The more info you give, the more chance of getting to the bottom of the problem.
Deepak on July 21, 2019
the input() in the file reading changed the for loop to except to hit enter to dispaly next line ?
Darren Jones RP Team on July 23, 2019
input() in the loop was present just to allow the user to press ENTER to continue or X + ENTER to exit the loop - just a simple way of making it interactive.
Anupam Anand on May 30, 2021
Hey, could you please explain why I have to use this because I am getting error:
import codecs
with codecs.open('sample.csv','r', encoding='utf-8',errors='ignore') as file:
Is this a correct way or I am doing something wrong?
Anupam Anand on May 30, 2021
Please ignore the previous comment.
Become a Member to join the conversation.
Abby Jones on July 10, 2019
I am getting an I/O error on the closed audio.mp3 file. Any ideas?