Removing Duplicates
00:00 In the previous lesson, I showed you how to reconcile two datasets. In this lesson, I’ll show you how to deal with duplicates in your data. I brushed over an important fact in the previous lesson.
00:10
Both the issued and cashed check CSV files have a column named Amount
. If you were paying close attention in the last lesson, you might have seen the fact that the resulting combined data had these columns renamed.
00:23 Off to the REPL to look at just what NumPy does. Onscreen is the combined dataset resulting from joining the issued and cashed check CSV files from the previous lesson.
00:35 At the end of that lesson, I sliced through a couple of columns. What happens if I do that with the amount?
00:50 Well, you get an error because there’s no amount field in the combined data. When I printed the combined array out in the REPL above, it included the data type information, but you can also ask for that directly.
01:03
This is the structural information. Note that amount
from issued
became amount 1
, and amount
from cashed
became amount 2
. For our data, these two columns were identical, so it doesn’t matter which you use, but NumPy doesn’t want to assume that. If I want to include the amount information in a slice of the combined data, I simply choose to use amount 1
instead.
01:36
Instead of using the rec_join()
to get a subset of data, you can also loop through the contents. Say, for example, you wanted to find all the uncashed checks.
01:58
This list comprehension looks for all the IDs in issued
, but not in cashed
.
02:05 The result is a list of the outstanding IDs. Note that the data types are NumPy 64-bit ints. You can cast them if you want Python data types instead.
02:22 NumPy speed comes from trying to do things using its mechanisms as much as possible. You can always loop through data like this, but it typically will be slower than using a NumPy specific method instead. In addition to having duplicate columns, you might have duplicate row data as well.
02:40
You can find this kind of data using the find_
duplicates()
function. One complication is that this function expects a masked array.
02:49 I briefly mentioned these when explaining filtering rows earlier. A masked array is a NumPy array with extra metadata that indicates whether a row is participating in a calculation or not.
03:01
You can convert your regular array into a masked one by passing it to the asarray()
function in the masked module. If you don’t want to find duplicates, but you just want to remove them, instead of using find_
, you can use unique()
instead. I’ve got another CSV file to play with.
03:20
This one is called issued_dupe.csv
. I created it by copying issued_checks
and duplicating one of the rows. Let’s head into the REPL and clean out our dupe data.
03:32 That’s our usual imports and a list of data type tuples from our CSV file. Now, I’ll load it
03:56
and there it is. Note ID 1344
is in there twice. To find dupes, I use the find_duplicates()
call passing in a masked version of our array.
04:11
The result is a new masked array that shows the duplicate data. Instead, if I want to get rid of the duplicates, I can call unique()
,
04:25
and this is the result with only a single row containing ID 1344
.
04:33 That’s all for the second example. Next up, I’ll start the third example, charting hierarchical data.
Become a Member to join the conversation.