Regular expressions are very useful tools to verify and parse textual data into structured information. They allow you to quickly identify specific pieces of data. In this blog post, we’ll be focusing on the use of regex groups in Python.
Consider for instance the following program that is aimed at taking a phone number as input and extracting specific parts of the phone number.
The pattern says “3 occurrences of \d” where “\d” refers to a digit. The re.search() function from the regular expression (re) module searches for this pattern and returns the answer in the variable called “parts”. We print parts.group(0) which is the full part of the content string that matched the pattern. When you run the program you get:
indicating that this was the part of the content string that matched the pattern.
We can generalize this to:
Here the pattern says 3 numbers followed by 3 numbers followed by 4 numbers. When we run this we get:
Sometimes we wish to extract a specific part of the string that is matched, not the whole string. This is where Python’s regex groups come in! Here is a very small update to the previous program:
Note that we have updated the pattern so that each part of the pattern is enclosed in parentheses. Each such part is called a “group” and when re.search() is run it returns not just the full match (in group(0)) but also each separate group, in subsequent indices. Thus the first \d3 is in parts.group(1), the second \d3 is in parts.group(2), and the \d4 is in parts.group(3). If we run this program we get:
Note that the full match is still in group(0) but now in addition we get each parts. In this way we can extract the area code of a phone number and other parts systematically.
We can improve this program even further because some people put brackets around the area code and some people put a hyphen (“-”) between 456 and 7890. This is not universally true but we wish to accommodate such cases. Thus, we can update our pattern as follows:
Note the rather complicated regular expression pattern. It begins with an optional bracket (the “[()]*” part), then the 4 digits, then another optional bracket, then zero or more spaces, then sequences of 3 digits and 4 digits separated by zero or more spaces. Note also that the brackets are only around the digits parts, so that is all we wish to extract. The output is:
Thus, we are able to extract the area code in parts.group(1) and the remaining two numbers in parts.group(2) and parts.group(3). Once again the full pattern is in parts.group(0).
In summary, Python regex groups are another powerful tool for writing complex regex patterns. They allow you to isolate specific parts of a pattern so that you can refer to them later in your code. To create a group, you simply wrap the desired part of the pattern in parentheses ( ). Each group will have a numerical index starting with 1 and increasing by one for each additional group in your pattern.
A great use case for regex groups is in parsing data and manipulating it in your program (just like we have done here, with parsing telephone numbers). By using groups in your regex patterns, you can quickly extract key pieces of information from large datasets with only minimal coding effort.
If you liked this blogpost, see our blogpost on the Python regex split() function which shows another way to extract specific fields by specifying delimiters.
Kodeclik is an online coding academy for kids and teens to learn real world programming. Kids are introduced to coding in a fun and exciting way and are challeged to higher levels with engaging, high quality content.