Reconciling Structured Arrays
00:00 In the previous lesson, I introduced you to structured arrays in NumPy. In this lesson, I’ll show you how to cross-reference data across two structured arrays.
00:10 NumPy has a module that specializes in relating data between two structured arrays. This feature allows you to do something similar to an SQL join where you relate rows from two arrays that have something in common, like an ID number.
00:23 Consider a CSV file with all the checks that have been issued from a bank, and a corresponding CSV file with all those checks that have been cashed. The reconciliation process relates those rows that have an ID that match between the two sets of data.
00:38 In this case, that would allow you to look up the payee’s name for cashed checks, even though that information isn’t in the CSV file. On the right-hand side, if you’re coding along at home, you’ll need data similar to what’s on the screen. Like always, you can get that in the supporting materials dropdown.
00:55
Let’s go to the REPL and reconcile this data. import numpy as np
. The reconciliation functions are in their own module, so I’m going to import and alias that as well.
01:12 I’m still going to need a path for reading my CSVs like before, and I’ll get started by declaring the data types that will be used in our structured data.
01:38 These are the column types from the issued checks CSV file. Note that I’m using float for the money column. You should really never do that. Floats and money are bad, but it keeps the example simple, so I’m going to be bad.
01:51 I wouldn’t generally recommend doing money things in NumPy as it doesn’t deal well with decimal values.
02:09
These are the data types for the cashed checks CSV. And now I’ll read the issued_checks
CSV file into an array.
02:30 And similar for the cashed checks.
02:45 Can I tell you, as a Canadian, it hurts me to keep writing checks with that spelling? Here’s the issued check data, and here’s the cashed check data. Since both sets of data have an ID field that is common, that would be our check number, you can reconcile between the two.
03:04
The rec_join
function creates a new array by reconciling two existing ones.
03:20 The first argument here is the name of the field to base the reconciliation upon. That’s the check ID. Then the first and second arrays get passed to reconcile, and finally, how to do the join.
03:31
The join types are the same as in SQL. An inner join, like I’m using here, means to return elements that are common to both sets of data. Let’s look inside combined
, and you get the rows and columns that are in both datasets.
03:46 Since this is a NumPy array,
03:56 I can ask for a subset of the columns giving me the data I’m interested in, who the check was to, when it was issued, and when it got cashed. If you’re coding along with me, don’t close your REPL.
04:07 I’ll be using the same data in the next lesson. If you were paying close attention, you might be grumbling at the screen, but what about, yeah, I brushed over the fact that both datasets had columns with the same names.
04:21 In the next lesson, I’ll talk about dealing with duplicates.
Become a Member to join the conversation.