Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Avoiding Glob Pitfalls

00:00 In the previous lesson, I showed you how to use insert to create copies of arrays while changing their shape. In this lesson, I’ll be talking about why the assumptions I made about the glob call aren’t really all that great.

00:12 In the two previous lessons, I’ve used this call to get a listing of path objects for the CSV files in the current directory. The question mark in a glob is a wildcard, but for only a single character. This approach limits you to 10 files named zero through nine.

00:29 You can change out the question mark for an asterisk, meaning any number of characters, but the order you get might surprise you. File names are text, and even numbers in them are text.

00:40 They get ordered by their Unicode code point numbers, not by their numeric equivalent. If you’re not careful, 10 can come before one. To make all this worse, it’s up to your operating system as to how to return the results, so you might not get the same sort order on two different computers.

00:58 In a second, I’ll head into the REPL to show you this problem and what you can do about it. One of the possible solutions is to use the natsort third-party library.

01:07 If you didn’t install it with the other dependencies I mentioned earlier, you’ll need to now. To demonstrate our challenge, I’ve created two more CSV files, one named file10.csv, and the other named file11.csv.

01:22 Technically speaking, they’re named file10 and file11, or at least that’s how their file names are going to get sorted.

01:35 The asterisk in the glob here will include the two new files as well as the original three. Inside the loop, I’ll just print out their names, and there you go: 11, 10, 1, 2, 3.

01:50 That’s not like any counting I’ve ever seen. I’m running this on a Mac. If you’re not, you might even get a different order.

01:57 And in case you think to yourself, oh, no problem, I’ll just sort them first.

02:12 Yeah, still not the desired result. Remember, file names are strings, so that isn’t 10, but the character one and then the character zero. This is an age-old problem in computing, and so there are libraries out there to help you with it.

02:27 The natsort library I mentioned earlier is a drop-in replacement for sorted that sorts in a more natural way.

02:45 All I’ve done here is replace the call to sorted with natsorted instead, and that’s the result you probably want.

02:55 Another way around this is to zero-pad your file names. If you know how many files you have and can provide leading zeros, your ASCII/Unicode code point sorting will correspond to the numbers.

03:08 Even with this though, you want to be a bit careful. The glob method’s default behavior is to use the operating system’s preference for case sensitivity, which by the way is different on different operating systems.

03:21 You can adjust this behavior with the case_sensitive argument to the call if you want. All in all, you want to be very careful with any assumptions you make with the order of glob, and sorting the results before you use them is probably best if the order you’re loading impacts the order of your code.

03:39 That’s it for the array from file section. In the next section, I’ll show you how to add some structural information to your NumPy arrays.

Become a Member to join the conversation.