Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Adding Structure to Arrays

00:00 In the previous lesson, I finished off the first example set on multi-dimensional arrays. This lesson is the beginning of the second example, and is all about adding information to your NumPy arrays.

00:13 A common use case for NumPy arrays is to store data in rows and columns like a spreadsheet. In fact, this is such a common use case that there are other libraries like pandas and Polars that are built on top of NumPy to add more features of this type. Even without pandas or Polars though, you can add structural information to a NumPy array.

00:32 This allows you to name rows and columns, which tends to make it clearer in your code as to just what you’re dealing with, especially since you can refer to these structures by name in your code.

00:43 Talking about the last name column is far clearer than index three. Let’s go to the REPL and look at some structured arrays. Adding structure to an array requires telling the array object what kind of data is in a column.

00:58 You do that by providing the dtype argument when you create the array. Importing numpy,

01:09 starting my array.

01:24 Like before this is my array content. It consists of some racehorse names and their data. Just what is that data? Well, let’s add some structural information.

01:45 The dtype argument takes a list of tuples with one tuple for each column in your array. The first item in the tuple is the name of the column, while the second is an indicator of the type of the column.

01:57 This indicator can be a variety of things, but here I’m passing in a string that tells NumPy about the type. The U is for Unicode, and the 12 says that the column is a 12-character string. f, and i correspond to float and integer respectively.

02:13 While the 4 indicates that these are four bytes in length, that would be 32 bits. When I look at the array, I now see both the contents and the associated type information.

02:26 Since I have this information, I can now reference columns using the name provided in the data structure.

02:36 That’s far clearer than an index value. If you’re coding, you now know what information you’re referencing. This kind of indexing also allows you to get at a subset of the information.

02:46 Say you wanted to see which horse was first, you could sort.

02:55 But let’s also say you didn’t want the position field since the sort order is specifying that already. You can sort and then reference a set of columns.

03:13 NumPy also lets you filter content through the use of comparison operations inside of the square brackets.

03:25 This returned just the horse in first place. Note that the library is doing something a little tricky to get this to work. Normally, with a comparison operation, the comparison would happen before the square brackets were evaluated, turning the contents into true or false, and then you’d be indexing either that true or that false, but NumPy overloads the comparison operation on its arrays.

03:48 So when you perform that kind of comparison, what is getting returned is known as a mask. A mask is a new array with true and false information annotating each row that meets the criteria in the comparison.

04:00 When you index with a mask, NumPy returns a new array with only those rows that are masked as true. A lot of data science libraries do this kind of thing.

04:10 I have mixed feelings about it. For someone coming from regular Python, this magic can be a little weird. It does make for some brief code though, once you get used to it, and since the return result is still a structured array,

04:30 you can slice the horse_name column out of that similar to how I did when I sorted a few lines back. Note that this slice is also a structured array.

04:39 If you really want just the value, you have to dereference that with an index.

04:52 This result is the actual string contents, but because it’s NumPy, it’s a NumPy string, not a Python string. You can cast that to the Python data type if you need to.

05:05 Once you’ve got structured arrays, you can use the named columns to cross-reference between two or more datasets. This is called reconciling, and it’s what I’ll cover next.

Become a Member to join the conversation.