Dear Christopher, thank you so much for creating yet another extremely useful high-quality tutorial. Learning regex with the course like this one is so much easier when compared to the Python official documentation! I am playing with regex now and I see that the regex character class syntax[0-9] works per digit and within the 0-9 range only. For example, I am trying to write a piece of code which would parse a text in a file and before tonekzing, tagging, and POS-ing it I want to make sure that I am actually working with the relevant corpus only, i.e. the text files which contain the text data meeting certain requirements as to the content.

The following regex:

IPC\sclass:\t[A-H][0-99][A-Z][0-99]/

Does not want to match the following two strings:

IPC class:    B65D85/804

However, when I modify the character class ranges above it works, but looks bulky:

IPC\sclass:\t[A-H][0-9][0-9][A-Z][0-9][0-9]/

Is there a more concise way of construing character ranges matching two-digit numbers in order to avoid typing [0-9] multiple times? If the answer is yes, is it possible to use subranges with such two-digit character ranges, e.g. for numbers from 10 to 70?

Thank you for your comment.

Christopher Trudeau RP Team on Feb. 7, 2021

Hi @DoubleA,

Glad you’re enjoying the course. RegEx’s are all string based, so the concept of a number that you’re after doesn’t work – the string just sees the characters, 0, 1, 2, etc. It doesn’t understand number ranges like “10-70”.

That being said, there are some shortcuts that could make things a bit less cumbersome. The \d character is a short form for “digits” and there are ways of saying “one or more”.

This would match your example string:

/IPC class:\s[A-H]\d+\w\d+\/\d+/

The \s is for white space and the \w for characters. The “+” says “one or more”.

The problem with this particular match is it would also apply to “B678D678/804548393” because the “+” is greedy and will match more than just two digits.

You can get more granular and be specific with the number of digits matched:

/IPC class:\s[A-H]\d{1,2}\w\d{1,2}\/\d{3}/

In this case the {m,n} notation means from m-n matches, with the n being optional. The first two uses in the example above match 1 or 2 digits only, the third only matches 3 digits. The {} notation is covered in the the lesson on Quantifiers.

Of course, that isn’t any easier to read than the original, but it is very precise.

If you’re writing code with regexes like this which are complicated I would definitely encourage you to look into the VERBOSE flag which allows you to embed comments into the regex – makes it far easier to remember what the regex does when you come back to it later. VERBOSE is covered in the lesson on Flags.

Happy pythoning!

DoubleA on Feb. 7, 2021

Hi Christopher,

Thank you so much for the detailed explanation. Indeed, I’ve completely forgotten about the {m,n} notation, which in the present case, appears to be the best candidate to use. And yes, the + operator can sometimes bring in very unexpected hits, so I try to avoid using it wherever possible.

The present course is by far the best course on Python regex I’ve come accross. Thank you for the great work!

Become a Member to join the conversation.