Matching Filename Patterns
00:00 You now have the ability to create and manipulate files and directories in a number of different ways, but a convenient thing to have access to would be some way to filter the different types and subsegments of files that you want to interact with, based on characteristics of those files.
00:16
Let’s say you want all files with a .txt
extension, or something like that. Well, the way to do that is using something called filename pattern matching.
00:25 In this lesson, I’ll go over the main ways to do that in Python.
00:30
The different functions that I’ll be using for pattern matching are kind of listed in an ascending order of complexity, but also of convenience. So first off, there’s .startswith()
and .endswith()
, which just operate on strings.
00:44
They’re part of the Python Standard Library of strings. Those can be useful when you’re dealing strictly with filenames. So, you could say filename.endswith('.txt')
or something like that; they take in substring parameters.
00:56
Then you have fnmatch.fnmatch()
, which takes in a filename and then a pattern. That pattern is of the general form of the Unix or Bash shell, the way that you match filename patterns with that shell.
01:11
Then it’s simply returns whether the filename matches that pattern. So, it’s a little bit more complex, but a little bit more useful than .startswith()
and .endswith()
because it can deal with patterns in the middle of strings and more sophisticated patterns than just plain text. Then you have glob.glob()
.
01:28
glob
stands for global, and it just takes in a search pattern and it returns a list of all the files in the current directory that match that pattern.
01:36
So it’s even a little bit more convenient than fnmatch
because you don’t have to loop through the files. You just put in the pattern. But it’s also not quite as simple to use because you have to make sure your pattern takes into account all the different possible files that you could be dealing with. And then pathlib.Path().glob
works similarly to glob
, and it just operates on a Path
object, as usual.
01:58 So, let’s take a look at how these work in the REPL. The sample directory that I’ll use for this has a bunch of similarly named files that’ll be useful for taking a look at with pattern matching.
02:11 I have all of my imports up here and of course, I’ve imported all of these just because I want to demonstrate them all. In reality, you probably only want one, whichever one you like best.
02:20 I’ll just list the things in the directory. As you can see, you got a lot of data files, text files, a couple of Python files, and then a subdirectory that also has some Python files in it.
02:31
The first thing that you might want to do is maybe you want to just isolate all the files that end in .txt
, and that’s quite simple to do with just basic string functions.
02:41
You can say for fname in os.listdir():
, and then relying on the fact that listdir()
just returns a list of strings, you can say if fname.endswith(".txt"):
print(fname)
.
02:58 As you see, that gives you all of the text files that you could want. And it’s really simple to use and pretty easy, especially if you are familiar with string methods.
03:07
And then, of course, you can do the same thing with .startswith()
. You could say .startswith("data")
, .startswith("t")
, whatever you want to do.
03:13 I’ll leave you to try that out a little bit on your own. But, as you can see, this might fail. It starts to become a little bit more difficult to work with when you want to say something like, “Well, what if I want to match for something that’s in the middle of the string,” right?
03:25
“What if I want to find all files that contain the word 'backup'
,” or something like that, right? It’s not clear how you would get that with just .endswith()
or .startswith()
.
03:34
You could use something like filename.substring()
, you could just check if this is a substring, but that starts to become kind of complex, and it starts to become a bit of a maintenance hassle, where you start to say, “Well, oh, what am I looking for?
03:48
What is a substring of what? How can I look for this? How do I easily check if it’s a match?” A better way to do that is with fnmatch
. So you can say fnmatch.fnmatch()
, and you just pass in some name, so let’s say something like "data"
—let’s use one of these actual files here—so, "data_01_backup.txt"
.
04:08 Then you just pass in a pattern. And this gives you access to all kinds of awesome things like wildcard characters. A wildcard character simply stands for zero to any number of any different kind of character.
04:22
It matches anything before the "backup"
and anything after "backup"
. So in this case, this returns True
because "data_01_backup.txt"
does contain "backup"
, right?
04:32
Then, if you want to filter it for everything in there, you could just say for fname in os.listdir():
if fnmatch.fnmatch(fname, "*backup*")
,
04:47
and then, of course, you have to actually have a :
, because that’s how print statements work—or that’s how, if statements work, I should say.
04:54
As you can see, this isolates all of the things with backup
in it. So, that’s pretty darn useful, and you can also use fnmatch
for something, like, you can start to use something called a character class.
05:05
You can check, for example, for anything that contains a two-digit number in it. And so again, this result is True
because there is a two-digit number.
05:15
But if I delete one of these digits, then I get False
, because there’s no longer a two-digit number in there and you need to match both of these character classes.
05:23 But then if I just delete one, then it will match anything with a single-digit number in it. So this is convenient and useful, but what it doesn’t let you do is easily search through your whole directory.
05:35
You have to do this looping logic. The way to do that is with glob
, or glob
, as a lot of people say, glob
short for global.
05:42
and with glob()
, you just pass in a pathname
and you tell it whether you want it to be recursive or not. So, this pathname
is really a pattern, like one of the ones that you did with fnmatch.fnmatch()
.
05:54
So with that, I could say something like, “Oh, well, let’s take a look at the backups.” I can find everything with "backup"
in the filename in the current directory, right?
06:04
Then I can do all of the same things that I could do with .startswith()
or .endswith()
, as well, by just using a wildcard character.
06:11
I can find all .py
(Python) files.
06:14
And now, what if you want to define these recursively? Because if I call os.listdir()
on the "sub_dir"
I’ll see that it has some Python files in it, as well.
06:23
I want to find all the Python files, you know, all of the subdirectories. Well, that’s also relatively easy to do with the recursive=True
option.
06:33 There’s one other thing you have to do, as well, which is you have to specify that it can be anything in any directory. The way to do that is with a double wildcard and then a slash.
06:42
So this will match anything in any directory. And then, of course, you’ll get both of the Python files in the subdirectory as well. As you might imagine, the pathlib.Path()
option works really similarly.
06:54
You just have to create the Path
object from the current directory, and then you can say path.glob()
. And with this one, it defaults to having recursive behavior.
07:06 That’s convenient because you don’t even need to pass in another parameter. But you do need to close all your strings, of course. As you can see, it returns a generator.
07:14 You can either convert that to a list and deal with a little memory overhead, or you can loop through it. I’m going to convert to a list, just because that’s a little more convenient to do in a tutorial.
07:24
As you can see, it does the same thing as glob.glob()
, it just returns a generator object and it defaults to recursive behavior. So, those are several different ways to pattern match in Python, and I would encourage you to also look at the Bash shell ways of file pattern matching as well, because that will enumerate some even more useful little tricks that you can use with wildcard characters, with optional characters, with character classes.
07:48
You can do a whole bunch of amazing stuff with it. Just check that out on your own. In the next lesson, I’m going to cover the os.walk()
function which lets you recursively walk over file trees and process the files as you like.
tonypy on March 12, 2023
One other observation. Using
glob.glob("**/*.py", root_dir=base_dir, recursive=True)
in the example given produces [‘admin.py’, ‘tests.py’, ‘sub_dir\file1.py’, ‘sub_dir\file2.py’]. Is there an easy way to tidy this list up so that the directory separators are either ‘' or ‘/’?
tonypy on March 13, 2023
Regarding the comment above. This should have read that the example given produces
[‘admin.py’, ‘tests.py’, ‘sub_dir\\file1.py’, ‘sub_dir\\file2.py’]
Is there an easy way to tidy this list up so that the directory separators are either ‘’ or ‘/’?
tonypy on March 13, 2023
One final question regarding pathlib. Using the example I can get file names using
[file.name for file in base_dir.glob("**/*.py")]
The result is
['admin.py', 'tests.py', 'file1.py', 'file2.py']
What I can’t see is a structure to get the equivalent of glob.glob() which gives the result relative to the defined reference path which in this case is the directory ‘Lesson 6’. That would give the result
[‘admin.py’, ‘tests.py’, ‘sub_dir\file1.py’, ‘sub_dir\file2.py’]
Any suggestions?
tonypy on March 13, 2023
Following on from above, I did determine that the following works
for pyfile in base_dir.glob("**/*.py"):
pyfile_rel = os.path.relpath(pyfile, base_dir)
print(pyfile_rel)
Where base_dir = Path(r“D:\Python\Real Python…\Lesson 6”) The output is then as expected, although not in a list
admin.py
tests.py
sub_dir\file1.py
sub_dir\file2.py
However, not very elegant. Any ideas on improving?
Martin Breuss RP Team on March 14, 2023
@tonypy hi, nice research! :D
I’m not on a Windows machine to check for path representation of glob.glob()
, but you see the double-backslash characters because Python needs to escape backslash characters. So, in a normal string that’s the way they’ll show up.
pathlib
solves this issue by a layer of abstraction around paths. When you work with pathlib
, then a path isn’t a Python string, but a Path
object instead. That gives you a lot of additional possibilities.
Two things that I wanted to pick up from your previous comments:
Recursive Search with .rglob()
You can make recursive search even more clear when you work with Path
objects by using .rglob("*")
:
>>> [file.name for file in base_dir.rglob("*.py")]
['admin.py', 'tests.py', 'file1.py', 'file2.py']
If you use .rglob()
instead of .glob()
, then you can omit the **/
part of the pattern. The method specifically does a recursive search.
Relative Paths with pathlib
You can achieve the same behavior that you’re looking for from glob.glob()
also with pathlib
, using .relative_to():
>>> [pyfile.relative_to(base_dir) for pyfile in base_dir.rglob("*.py")]
[PosixPath('admin.py'),
PosixPath('tests.py'),
PosixPath('sub_dir/file2.py'),
PosixPath('sub_dir/file1.py')]
And if you wanted to show only the string representation of these Path
objects, then you could wrap them into str()
:
>>> [str(pyfile.relative_to(base_dir)) for pyfile in base_dir.rglob("*.py")]
['admin.py', 'tests.py', 'sub_dir/file2.py', 'sub_dir/file1.py']
Hope that helps! If you enjoy pathlib
(I do!), then you can check out the following resources we have on the site:
tonypy on March 14, 2023
Martin,
Many thanks for your feedback and suggestions, they were very useful. I did have realpython.com/courses/pathlib-python/ bookmarked so will be taking that soon.
anaghost on May 9, 2023
it would be nice to add how to deal with shutil being unable to delete dirs when there are permissions issues.
Become a Member to join the conversation.
tonypy on March 12, 2023
When using Windows, and where the path to the target directory has been defined e.g. base_dir = Path(r”D:\Python\Real Python...\Lesson 6”), then s this is a WindowsPath I found two options.
Using glob.glob(os.path.join(base_dir, “backup”)) gives the full path for each file e.g. “D:\Python\Real Python...\Lesson 6\data_01_backup.txt”
To avoid that and just generate the file names matching the pattern, consider the use of glob.glob(“backup”, root_dir=base_dir) which then produces the desired list of just the file names e.g. [‘data_01_backup.txt’, ‘data_02_backup.txt’, ‘data_03_backup.txt’].
Is that an acceptable approach within the context of the specification for glob.glob()? Or is there a better way to get just the file names matching the pattern in the target directory?