Use Sets to Remove Duplicates
00:00
Yes, exactly. So, it’s like the length of the markers seems to be something that we should parametrize in our code. So I guess let’s do that. I’ll introduce a marker_length
variable here that we can then switch between part one and part two by changing this one between 4
and 14
.
00:17 Okay. And then I’m going to think a little bit about what we’re looking for here, are sort of like duplicate characters or technically where there are no duplicate characters.
00:29 And Python has a built-in data structure that’s really nice to work with when we’re dealing with duplicates, namely sets.
00:36 So before moving on, let’s just show how sets work here. So if I’m going to just print this part, and then I’ll just comment out the rest of this for now.
00:52 So now my code here, I’m using the string that we have on top here, and I’m just converting this one into a set and then printing it out. And what we may be able to see here—it’s not completely obvious, I guess, since there’s so many characters—is that there are no duplicates any longer in this.
01:12
So you can see the, the order is kind of changed around a little bit, but for instance, there is a j
, which we have j
in the second and the fourth character, but in our set, there’s only one of the j
s.
01:25 What we can start doing now is that we can convert our small substrings of the string into sets and then see if the set means that some characters disappear essentially.
01:36 So since it removes duplicates, we can use that as a test for whether there are duplicates, okay, in our character sets. So as an example of this, let’s say that we’re looking for a, let’s just call it a candidate.
01:52
And I’ll say that this will start here by just picking out the, okay, let’s use marker_length
here since we have it. So here I’m just pulling out the first four characters for the marker_length
for first four characters from the candidate.
02:08 And then I’m going to print out
02:13
both candidate
and the set of candidate
so that we can see what happens. Okay. And what you were doing there with stream
and the square brackets and the colon is you’re slicing the string from the stream
variable, which is a string from index zero to marker_length
, which is four.
02:35
So if we look at candidate
that’s down here, that’s just now mjqj
, and then I look at the set of candidate
, and that’s then the letters m
, q
, and j
.
02:46
So you can see that the duplicated j
has disappeared.
02:51
The important observation here is now that the length of my set of candidate
is less than the marker_length
. It’s less than four, so that means that there has to be a duplicate somewhere here.
03:02 Okay. Yeah, because if it would be four unique characters, then the length of the string would be the same length as the set. Yes, exactly.
Become a Member to join the conversation.