Use re.sub()
00:00
Create a new file named transcript_re_sub.py
. First, you need to import the re
module. Then you can paste the transcript from before, and again you’re using transcript
as the variable name, with a triple-quote string that contains the chat transcript. Next, create a list of regex replacements.
00:21
The regex_replacements
list should contain tuples. Each tuple will contain the regex pattern and the replacement.
00:28
The first regex pattern is "blast\w*"
. This pattern makes use of the "\w"
token, which will match alphanumeric characters and underscores.
00:41
That means all characters and numbers and underscores will be matched. Adding the asterisk quantifier (*
) directly after it will match zero or more characters of the "\w"
pattern. Also, note two things.
00:55
There is the r
in front of the string. An r
before a string indicates that this string is a raw string. This means that backslashes within the string will be treated as literal backslashes rather than an escape character.
01:09
That’s important because regular expressions often use backslashes. The other thing to note is that you write "blast"
in lowercase. With the way you’ll use re.sub()
in a moment, you don’t have to bother if a string is lowercase or uppercase or mixed. As a replacement, add the huffing emoji again.
01:30
The second regex pattern uses character set quantifiers to replace the timestamp. You use an extended character set of " [-T:+\d]"
to match all the possible characters that you might find in the timestamp. So namely, this will be a dash, the uppercase T, the colon, the plus, and any digit character. That’s what the "\d"
stands for.
01:57
You put all of them in square brackets to make them a character set. You pair this character set with the quantifier "{25}"
. This will match any possible timestamp—well at least until the year ten thousand, which is good enough for us right now.
02:13 This timestamp regex pattern allows you to select any possible date in the timestamp format and replace them with an empty string. Don’t forget to add a space right before the square brackets of your timestamp character set.
02:25 That way, you make sure to also replace the space that’s in between the usernames and the timestamp.
02:32
The third regex pattern is used to select any user string that starts with the keyword "support"
.
02:38
This pattern catches "support_tom"
or "support_frieda"
. Note that you escape the square brackets because otherwise the keyword would be interpreted as a character set. And in this case, you want the square brackets to literally match.
02:54
You replace the found pattern with the string "Agent "
. That way, the agent string will have the same number of characters as the client string. I know that’s a bit hacky, but it will help you to align the columns after a string "Agent "
and "Client"
in your sanitized output. And finally, the last regex pattern.
03:13
This one selects a username’s string and replaces it with "Client"
. Again, you’re using the backsplashes to escape the square brackets. And that’s your regex replacement list. Compared to the string replace script from a former lesson, you can now replace all variations of the swear word by using just one replacement tuple. Similarly, you are only using one regex for the full timestamp. That’s a big improvement.
03:39 Next, you need to loop through the replacement tuples.
03:43
for old, new in regex_replacements:
03:48
transcript = re.sub()
. old
, that’s the regex pattern. new
, that’s the replacement string. And the third argument is the string that you want to check, so in this case, you pass in transcript
. And then you also add the optional argument flags
and set it to re.IGNORECASE
.
04:08
IGNORECASE
is one word, and you write it in uppercase. The re.IGNORECASE
flag makes it a case-insensitive pattern. So now any substring containing "blast"
, regardless of capitalization, will be matched and replaced. And finally, you print the transcript again. Okay, let’s see if the script works.
04:30
To run the script in a terminal, you type python
space and then the name of the script, which is transcript_regex.py
. And then you press Enter. Okay, that looks good.
04:43 Your transcript has been completely sanitized, with all noise removed. The script you just created is a great starting point for adding more regex rules and patterns if you receive new transcripts over time. But for now, I would say you’re good. You did a really good job cleaning the transcript.
05:00 So let’s wrap things up in the next lesson, where you’ll recap what you learned in this course, and I’ll share some additional resources so you can deepen your knowledge even more.
Become a Member to join the conversation.