Resource mentioned in this lesson: A Practical Introduction to Web Scraping in Python
Avoiding Common Character Removal Mistakes
00:00 Let’s look at some of the most common character removal mistakes people make. The first is just confusing these different string methods and their behavior.
00:08
The next is a little subtle, misunderstanding how .strip()
handles whitespace. And then this is so common we’re bringing it up one more time.
00:16
Assuming the .strip()
method removes specific character sequences. First, let’s quickly review the differences between the string methods.
00:26
In this table you have an example string: --evidence
.doc--
. The result column illustrates what is returned when you call each of the five character removal methods and pass in a single dash.
00:40
Calling .strip()
results in evidence.doc
, removing all leading and trailing dashes. .lstrip()
leaves the dashes at the end: evidence.doc--
. .rstrip()
does the opposite, resulting in --evidence.doc
.
00:56
.removeprefix()
removes exactly one dash from the start of the string, leaving you with -evidence.
doc--
and .removesuffix()
only strips the final dash: --evidence.doc-
. To investigate, the other common mistakes pop up in the REPL.
01:15
For this example, you’ll need a string with lots of space to work with. spacy_text =
a string with the contents Little , green men
, with varying amounts of spaces surrounding each word and punctuation mark.
01:30
Try calling spacy_text.strip()
and see what happens.
01:35
The leading and trailing whitespace is removed, but not the extra spaces in between. So if you have text with this kind of extra whitespace, .strip()
probably isn’t the right tool.
01:45
What you could do instead is use the .split()
method of strings. By default, it will split the string on whitespace and return a list of the resulting parts.
01:53
Create a new variable words
and have it store the results of calling spacy_text.
split()
: words = spacey_text
.split()
. Look at words
.
02:06
A list of the five strings: Little, green men.
From here, you can use the .join()
method of strings to combine the list of strings into a single string.
02:16
Starting with a string containing a single space, call .join()
passing in words
. This is better. It’s not quite natural language because there’s still a couple spaces left, but you’ve managed to reduce the interior whitespace to a single space each.
02:31
Alternatively, you could use the .replace()
method, which takes two arguments: the substring to be replaced and the replacement value. So if you try this: spacy_text.replace()
passing in space and an empty string, you get, well, Little,greenmen.
Depending on your specific use case, you can decide which of these two approaches would be more effective.
02:54
And there’s another edge case you might encounter with .strip()
’s default removal of whitespace. While it covers the most common whitespace characters, there are some invisible Unicode characters that you might find while scraping a web page or reading a PDF file that won’t be removed.
03:11
Define text
as a string with the word Spooky
and a couple copies of \u200b
on either side. That’s the escape code for Unicode zero-width space, by the way. If you print the text
in the REPL, you’ll see these characters represented as spaces.
03:29
And if you try calling text.strip()
those spaces haven’t gone anywhere. Spooky indeed. Instead, you’ll have to call .strip()
passing in that particular escape code: text.strip("\u200b")
03:47 And one more example. You’re scraping a web page for some information about paranormal phenomena, and you want to extract the text from an HTML tag.
03:57
html_tag =
the string "<title>
tunguska event</title>"
.
04:04
If you try to use the strip()
method here: html_tag.strip()
("<title>")
.
04:12
Whoops. You get unguska event</
and part of the closing tag. As you’ve seen before, this approach isn’t going to work. But a nice one-line alternative is to chain together calls of .removeprefix()
and .removesuffix()
: html_tag.
`removeprefix(“<title>”).removesuffix (“</title>”)` and perfect.
04:39
Just the text tunguska event
. The methods removed exactly the strings passed to them, nothing more. And because they each return a new string, they can be chained together.
04:47 Another common and versatile pattern for Python string manipulation. And if this got you curious about web scraping, why not check out A Practical Introduction to Web Scraping in Python? Or maybe just save a bookmark because in the next lesson, you’ll apply what you’ve learned by working with a realistic example of using string methods to clean messy data.
Become a Member to join the conversation.