Flat CSV Files
00:00 In the previous lesson, I gave an overview of the course. In this lesson, I’ll talk about CSV files as an example of flat file storage.
00:10 Your software stores data in memory through the use of variables, but like life, this is fleeting: exit the program or turn off your computer, and the memory is gone.
00:19 This can be addressed through saving things out to some sort of storage, usually your disk drive. A side benefit of this is if it’s done well, it can be used to interact with more data than can be held in memory at any one time.
Flat file storage is a generic term for text-based files that typically can be read by a person. There are many different formats. Some common ones are CSV, the one which I’ll be using shortly, JSON, and the granddaddy of angle brackets, XML. Let’s go look at a program that uses Python’s
csv module to see the kinds of things that can be done with a flat file format.
01:03 And here it is: fourteen lines of data. This is a listing of books. CSV stands for comma-separated values, and looking at this data, you can probably see where it gets its name.
01:14 It’s a list of values separated by commas. Clever, huh? A lot of CSV files have a header line by convention. The header line gives the name of each column of data.
01:26 I’ve highlighted it here. After the header row, each row here is a book with the author’s first and last names followed by the book’s title and then the name of the publisher.
01:37 In the case of special characters, for example, wanting to have a comma, you typically surround the value in quotes. CSV is a loosely defined thing. You will run into variations on how to use it, but the format here is what Excel uses and being one of the most popular programs on the planet weighs pretty heavily.
Here’s some code that reads in that same CSV file and prints out some summary information. It does this through the use of the
csv module, in particular the
DictReader takes a file, which it expects to have a header row, and then allows you to iterate over that using a reader. The reader will contain a series of dictionaries.
The keys in each dictionary are the column header name, while the value is the corresponding value from the row. I’m using a
defaultdict here to store some summary information.
defaultdict is created with an integer, which means you can assume that a key exists, and if it didn’t, the
defaultdict will create one automatically for you in the form of an
int. I loop through each row of data, adding the author’s name and publisher’s name into the respective dictionaries to form a count of occurrences. After that, I print out a summary of how many instances of each author name and of each publisher. Let’s go run this.
03:14 There you go: a count for each author and a count for each publisher.
03:22 CSV has some limitations. For example, data is repeated. There are four books by Stephen King, and each get a row. That means Stephen’s name shows up four separate times.
03:34 This is because everything is from the rows’ perspective, which in this case is the book. If I want to add the age of an author, technically I can do it, but I’d have to store it on each of the four instances of King’s books.
03:48 If I need to make a change, I’d have to do that in all four places. Essentially, any data associated with the book is just that—it’s an attribute of the book. I can’t really interrelate different items easily.
04:03 For example, Richard Bachman is an alias of Stephen King’s, or detecting It has one author but two publishers—it’s messy. To detect any of these things, you’d have to write some custom code.
04:18 You probably know where this is leading. Relational databases to the rescue! They solve all these problems. Next up, I’ll show you how.
Become a Member to join the conversation.