Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Removing Duplicates

00:00 In the previous lesson, I showed you how to reconcile two datasets. In this lesson, I’ll show you how to deal with duplicates in your data. I brushed over an important fact in the previous lesson.

00:10 Both the issued and cashed check CSV files have a column named Amount. If you were paying close attention in the last lesson, you might have seen the fact that the resulting combined data had these columns renamed.

00:23 Off to the REPL to look at just what NumPy does. Onscreen is the combined dataset resulting from joining the issued and cashed check CSV files from the previous lesson.

00:35 At the end of that lesson, I sliced through a couple of columns. What happens if I do that with the amount?

00:50 Well, you get an error because there’s no amount field in the combined data. When I printed the combined array out in the REPL above, it included the data type information, but you can also ask for that directly.

01:03 This is the structural information. Note that amount from issued became amount 1, and amount from cashed became amount 2. For our data, these two columns were identical, so it doesn’t matter which you use, but NumPy doesn’t want to assume that. If I want to include the amount information in a slice of the combined data, I simply choose to use amount 1 instead.

01:36 Instead of using the rec_join() to get a subset of data, you can also loop through the contents. Say, for example, you wanted to find all the uncashed checks.

01:58 This list comprehension looks for all the IDs in issued, but not in cashed.

02:05 The result is a list of the outstanding IDs. Note that the data types are NumPy 64-bit ints. You can cast them if you want Python data types instead.

02:22 NumPy speed comes from trying to do things using its mechanisms as much as possible. You can always loop through data like this, but it typically will be slower than using a NumPy specific method instead. In addition to having duplicate columns, you might have duplicate row data as well.

02:40 You can find this kind of data using the find_ duplicates() function. One complication is that this function expects a masked array.

02:49 I briefly mentioned these when explaining filtering rows earlier. A masked array is a NumPy array with extra metadata that indicates whether a row is participating in a calculation or not.

03:01 You can convert your regular array into a masked one by passing it to the asarray() function in the masked module. If you don’t want to find duplicates, but you just want to remove them, instead of using find_, you can use unique() instead. I’ve got another CSV file to play with.

03:20 This one is called issued_dupe.csv. I created it by copying issued_checks and duplicating one of the rows. Let’s head into the REPL and clean out our dupe data.

03:32 That’s our usual imports and a list of data type tuples from our CSV file. Now, I’ll load it

03:56 and there it is. Note ID 1344 is in there twice. To find dupes, I use the find_duplicates() call passing in a masked version of our array.

04:11 The result is a new masked array that shows the duplicate data. Instead, if I want to get rid of the duplicates, I can call unique(),

04:25 and this is the result with only a single row containing ID 1344.

04:33 That’s all for the second example. Next up, I’ll start the third example, charting hierarchical data.

Become a Member to join the conversation.