Diving Into Advanced Splitting Techniques
00:00 Now, sometimes you need to split a string, not just by one character or multiple characters that appear together, but by something more flexible. Maybe one of the few characters that you’ve defined are patterns that are not even fixed, maybe a comma sometimes, maybe a semicolon or a question mark.
00:18
That’s where re.split
method comes in. It works like the regular .split()
method, but it lets you use regular expressions as the separator.
00:27
The syntax is straightforward. You import the re
module, then call re.split
and pass the pattern and the string. If you’re confused about how it works, don’t worry.
00:39
We will see an example in the next slide. So when do you actually use the re.split
method instead of the regular string .split()
method?
00:48 You use it when the delimiter is not fixed, like when the delimiter can be any digit, punctuation mark, or a whitespace. Some common scenarios are when complex log parsing, sentence segmentation, or text processing in NLP.
01:02 Now let’s look at an example.
01:05
Let’s say you have a string that you need to split on any digit. Sometimes the digit can be one, sometimes it can be two, and sometimes it can be three. Now you cannot use the regular string .split()
method here because you need to define only one delimiter, either one or two or three.
01:23
So we need to use re.split
method to split the string. First, we need to import the re
module,
01:31
and then we just call the re.split()
method.
01:36
And first, it’s a good practice to type r
here, which means Python should treat this as a raw string. And then we add our regular expression here.
01:46
So the \d
means any digit. So it can be any digit ranging from zero to nine, but our delimiter can be eleven, twelve, thirteen, or fourteen, so what we do is we add a plus.
01:58
This means that one or more of the preceding element. So in our case, it is one or more digits. And then we just pass our string, that is items
.
02:10
Now let’s print
to see what item_list
contains.
02:16
There is an empty string at the start because the original string began with a digit. Now if you want to remove the empty string, you can just slice the item_list
, start from the element one till the end, and now you have your list separated by digits.
02:34 Let’s take a look at another example.
02:37 In this example, you have a sentence that contains an exclamation mark, a question mark, a comma, and then a period. We need to do this kind of splits mostly in NLP tasks.
02:48
So we want to split Wait!
and then Are you serious?
then Yes,
and then absolutely.
So any sentence before question mark is a question.
02:57
So how can you do this? Again, like earlier, let’s import the re
module. We’ll have the parts
in the parts
variable.
03:06
Let’s call re.split()
method. And this time, we want to have any of the following characters as the delimiter. So it can be either a comma, a question mark, a period, or an exclamation mark.
03:21 And then we pass our sentence.
03:25
Now let’s print
the parts
to see how it is split.
03:29 And as you can see, we have the parts of the sentence split by any punctuation mark. Now if you’re new to regular expressions or want a deeper understanding of the patterns we are using here, you can check out the Regex course I’ve linked in the additional resources.
03:46 It’ll walk you through the core concepts step by step.
Become a Member to join the conversation.