Populating Multidimensional Arrays

Christopher Trudeau

NumPy Techniques and Practical Examples Christopher Trudeau 04:54

Transcript
Discussion

00:00 In the previous lesson, I showed you how to create multi-dimensional arrays in NumPy. In this lesson, I’ll show you how to populate those from multiple CSV files.

00:09 As I showed you in the last lesson, the NumPy array object is multi-dimensional, and you can see how many dimensions it has and how large each is by accessing its .shape attribute.

00:20 This is a tuple with each value in the tuple being the size of the corresponding dimension. Now, let’s say you actually want to create a two-dimensional array from some existing data, like two one-dimensional arrays or two lists.

00:33 One way to approach this problem is to use the concatenate() function, but the downside of this approach is it requires three arrays in memory, the two one-dimensional arrays and the resulting combined one. For large datasets, this can mean a lot of memory and a bunch of compute time to do the copying.

00:51 And of course, this just gets worse if you want to go into three dimensions, which is what I’m going to show you how to do. Say you have three CSV files each with row and column data, which when combined produce your desired three-dimensional array.

01:05 Instead of creating three two-dimensional arrays and combining them, you create a zeroed array of the correct size and then read each of the CSV files, overriding the appropriate part of the 3D array with your data.

01:18 This approach takes a bit of computing power to replace the parts of the final array, but memory-wise, you have less space being taken up at any given time.

01:29 This slide visualizes our approach. Three files named file1.csv, file2.csv, and file3.csv contain slices of our final array.

01:37 These get combined into a three-dimensional chunk. To get the desired result, I’ll create an array filled with zeros, loop over each of our data files, and replace the data from there.

01:48 Let’s go to the REPL and try this out. If you’re following along, you’ll either need to create the three CSV files here on the screen, or you can download all the files, including the sample code from the supporting material dropdown just below the video player on your screen.

02:05 The pathlib library has useful tools for dealing with file names, so I’ll start by importing the Path class from it.

02:13 Then of course, I’ll need NumPy, once again aliasing it, and now to get started, I need a three-dimensional array of the correct size. I’ll use the zeros() factory call, which I covered in the previous lesson to create such a thing,

02:30 and there’s my array filled with zeros.

02:35 Python’s id() function shows a unique identifier for an object instance. I’m doing that now just to show you that no array copies will get created as I go along.

02:44 After I’m done messing about, our array will be the exact same one.

02:50 Okay, now it’s time to loop over each of our three files and stick them in the array.

03:02 That’s a lot and a bit messy. Let’s go through it a bit at a time. About two-thirds of the way through our loop declaration is the Path class calling .cwd(), that’s short for current working directory and returns a path object for the directory that this code is running in.

03:18 That object then has the .glob() method called on it, which returns an enumeration of path objects whose file names match a pattern. The pattern here is file?.csv, so our three file1, file2, and file3 files get matched and returned.

03:36 All of this is wrapped in enumerate(), meaning you’ll get a path object and a count for each of the matching files.

03:44 Inside the for loop,

03:51 I use the counter to replace one dimension of the existing zeroed array with the results from NumPy’s loadtxt() call. As you might guess, this call loads data from a file and with the delimiter parameter set to comma, this is essentially reading a CSV and returning a corresponding array, and that’s pretty much it.

04:11 Our array now contains the data from the files. Let’s take a look. And there you go. Three beautiful dimensions populated from three different CSV files, and as you can see, the ID of the array object hasn’t changed, so no array copying has happened.

04:30 The total memory of this approach is the sum of the final result array, plus whatever temporary storage, the loadtxt() call needs. That can still be a lot, but it’s far less than having every single dimension resident in memory at the same time. Sometimes, you want to muck with the structure of your array.

04:51 In the next lesson, I’ll show you how that’s done.

Become a Member to join the conversation.