Traversing Directory Trees
00:00
In this lesson, I’ll show you how to traverse entire directory trees and process the files that you find. That’s distinct from getting a directory listing, in that when you get a directory listing with something like os.listdir()
, you need to do some extra work to process all the subdirectories, as well.
00:16
But I’ll be showing you the os.walk()
function, which lets you walk an entire directory tree with very little work. As I mentioned, I’ll be using mostly the os.walk()
function, which takes in a directory path, which is the root of the traversal, and then a parameter called topdown
, which says whether to start processing at that directory path, or at the farthest child of that path.
00:41
I’ll show you what that means in the terminal. When you pass in that directory name and the topdown
parameter, you get an iterator of tuples.
00:50 Each tuple contains the current directory’s path—a string—a list of the subdirectory names of those, which are also strings, and then a list of all of the files in that directory on each iteration. So it goes through all of these, and in the iterator it split things up into nice, easy lists of the files, the directories, and then the current directory path.
01:13 Let’s take a look at how it works in the Python REPL, after looking at the sample directory. The sample directory is pretty simple: a couple of text files, and then the two folders, both of which have Python files with different names in them.
01:29
I am in the directory that I said I would be in, a couple of text files, a couple of folders. And if I call os.walk()
—
01:38
well, first let’s talk a little about the parameters. As I said, there’s the top
and the topdown
parameter. There’s also a couple of others that I would encourage you to look up in the documentation because I’m not going to talk much about them.
01:49
There’s an onerror
, which is a function that says what to do on error, it defaults to None
. Then there’s a followlinks
parameter, which just says whether to follow symbolic links or not; symbolic links are kind of like links to other directories.
02:01
I won’t talk much about them. Definitely take a look on your own if you’re interested. So, I’m just going to call it on the current directory and then I’m going to leave topdown
as True
. And, as you can see, it’s a generator object, which means that you have to iterate through it if you really want to get much out of it.
02:16
So, for cur_dir, sub_dirs, files
—and this is just the order that the iterator returns these things in—in os.walk("./")
, with the current directory as the top.
02:33
I’m going to do a little printing logic here, and I’ll say f"Processing {cur_dir}"
, and then I’m actually just going to print out the sub_dirs
.
02:46
I’ll print a little string, just that you can see what it is. So, ("Sub-dirs", sub_dirs)
, and then I’ll print("Files", files)
.
02:59
As you can see, first it processes the current directory with two subdirectories and two files. Then it processes folder_2/
, which is just the first folder in the sub_dirs
list, and has no subdirectories, but a few files.
03:12
And then the same thing for folder_1/
. Now, if I do the same thing and I pass in topdown=False
, then you’ll see the behavior change just a little bit.
03:24 Now it goes in reverse order, so it processes the children of the current directory, first.
03:31 One other thing about the ordering that I want to do is I want to do a quick little exploration and show you whether the behavior of this traversal is a depth-first or a breadth-first search. Simply put, does this walking procedure go down all of the children of a given child before it starts processing?
03:51
Or, does it process in order of the children, then their children and so on? And I think this will become clear to you if I make one more subdirectory, which is, let’s say, "folder_2/sub"
,
04:04
so I’ll give folder_2/
a subfolder. Then I’m going to run the same thing here: os.walk()
with topdown=True
. You’ll see that first, it processes the current folder, then it processes folder_2/
, then folder_2/
’s child.
04:21
It goes all the way down this path before coming back to folder_1/
, the next child of the original folder. So, if you’re familiar with DFS and BFS, that means this is a DFS order.
04:33 If you’re not familiar with that, be on the lookout for tutorials on the subject from Real Python, or from any other source that you use, and take a look at those because it’s a really foundational computer science concept and the behavior will be very predictable once you understand how the DFS works.
04:50
So, that’s os.walk()
. I find it really convenient, not only because of its awesome recursive behavior, but also just because it splits up the traversal so nicely into your current directory and then splits the directories and files into two separate lists really quickly and easily.
05:06 So I like to use this even when I don’t need recursion. In the next lesson, I’m going to cover temporary files and temporary directories, which can also be really useful, especially in testing constructs.
Become a Member to join the conversation.