Exploring Real World Use Cases for .strip()
00:00 We’re back in the REPL for this example, which is all about cleaning messy data. Suppose you work for an organization where field agents investigate paranormal events and submit case reports.
00:10 It’s a form submission and they’re responsible for filling out the information correctly. Unfortunately, humans don’t always do a great job of this. So the raw data for one such submission might look something like this.
00:22
It’s a dictionary called data
. The first key, the string file_name
has the string value of a few spaces. "OREGON Report 1993.pdf "
, and a couple more spaces at the end.
00:35
The string "source"
is the next key, and its value is a string containing only a space, no other content. The last key is the string "tags"
, and it contains a list of strings " confidential "
with extra whitespace around it, " ufo "
, with a space and a tab character around it, and "abduction"
with a leading newline, and some trailing spaces. Looks just like data provided by an unreliable human.
01:00 So first, for the filename to be valid, you want to get rid of those leading and trailing spaces, replace the interior spaces with underscores, and also make everything lowercase.
01:10
Any ideas on how you could achieve that? You can apply multiple string methods in a single line of code by chaining them together. Start by calling .strip()
, then replace any remaining spaces with an underscore using the .replace()
method, and then call the .lower()
method to make sure everything is lowercase. data
at the key file
name
equals data
at the key filename.strip()
01:33
.replace()
, passing in a space for the first argument and an underscore for the second. And then calling .lower()
01:42
and inspect the result: data
at the key filename
: oregon_report_1993.pdf
. Perfect. And note that it’s important that you called .replace()
after calling .strip()
because you only wanted the interior spaces to be replaced by underscores.
01:59
Next, the source
field shouldn’t be left blank, but they submitted data with a space instead of filling out a name. What you can do is use .strip()
with a conditional to see if the field is actually blank, and if so, replace it with a placeholder value.
02:13
if not data
at the key ["source"].strip()
02:17
then data
at the key source
equals the string UNKNOWN
, and see the result.
02:27
data
at the key source
has been replaced with UNKNOWN
. Because .strip()
removes whitespace, the result is an empty string, which in Python is falsey.
02:35
So when combined with not, the conditional returns true and the data
at the key source
is reassigned to the string UNKNOWN
.
02:43
And finally, some tags
were added, but of course there’s also a bunch of whitespace that needs to be removed. To clean up everything in one go, you can use a list comprehension to call .strip()
on each element of tags
and return a new list of clean strings.
02:57
data
at the key tags
equals the list comprehension [tag
.strip() for tag in data
at the key "tags"
.
03:07
And check it out: data
at the key "tags": confidential, ufo
and abduction
. Nice. And with that, the data is clean and nicely formatted.
03:18 Of course, if you really want to go above and beyond at your job, you take all of this logic we’ve written and refactor it into a reusable data cleaning function.
03:26 But I’ll leave that as an exercise for you.
03:30 And all that’s left now is the review and wrap up. See you there.
Become a Member to join the conversation.