Assume you have a string “The quick brown,fox.jumps over the lazy dog at 11:59PM.” and wish to split a string into individual words. Note that the string has sometimes spaces between words and sometimes commas and periods. We wish to extract individual words like “The”, “quick”, “fox”, and so on.
Python’s re.split() method from the regular expression module is just the function you need! It can be used to split a string into multiple substrings. The method takes a regular expression pattern and a string and uses the regular expression pattern to define the delimiters and extracts the substrings surrounding the delimiters.
So here is our first attempt:
Here we first import the “re” module and then define our sentence mentioned above. The pattern is defined to be a single space character before calling “re.split”. The output is:
Note that because the pattern is a single space character, you do not see a split of “brown,fox.jumps” into individual words. If you wish to separate by comma, for instance, we can update the code to:
Now the output is:
This time, the comma is indeed used as a delimiter but we have lost the delimiting by space characters because we updated the pattern.
Can we use both space and comma to separate the words? You can! Let us try this:
The output is:
Hmm.. This didn’t quite work. That is because re.split() is looking for the exact pattern, namely a space followed by a comma and there is no such occurrence. What you really want to do is:
Here the pipe character (“|”) separates the space and comma and the pattern is essentially saying that the delimiter character can be either of the two. The output is now what we were looking for:
We can now update the code even more to include the colon as a delimiter, like so:
The output is:
Let us also add the period (‘.’) as a delimiter:
The output is:
Wow - what happened? It turns out that the period (“.”) matches anything when used in a pattern. You should instead have “escaped” it when specifying it inside a pattern like so:
This produces what we were looking for:
Note that there is an empty string at the end because the period separates the “59PM” from the end of the string. The above code can be made more succinct by doing:
Here all possible delimiting characters are specified inside square brackets and this is Python shorthand for saying any one of them can be used to split the sentence. Note also here that we did not have to escape the period character when used within square brackets. The output is still:
Let us continue along this thread. Let us introduce multiple spaces between words, and see what happens eg:
Here we have deliberately added some spaces between “quick” and “brown”, and between “jumps” and “over”. The output is:
In other words, the splitting works but there are extra empty strings created because of the multiple occurrences of the delimiter characters. One way to overcome this is to do:
Here we have added a “+” outside the square brackets which means “1 or more occurrences of”, i.e., 1 or more occurrences of any of the delimiters found inside the square brackets. Thus the delimiter can be a single space, multiple spaces, or a space with a colon, etc. The output is now:
as we expected.
In this article, we have taken a look at how we can use Python's built-in re module to split strings using the re.split() method. We have also looked at some examples of how this method can be incrementally adjusted to split based on different ideas of what delimiters should be.
To become proficient in re.split() you should have a good understanding of how to match different types of characters and the symbols used in delimiting patterns to represent them.
If you liked this blogpost, see our blogpost on Python regex groups which shows another way to extract fields by isolating specific parts of your pattern rather than by specifying delimiters.
Kodeclik is an online coding academy for kids and teens to learn real world programming. Kids are introduced to coding in a fun and exciting way and are challeged to higher levels with engaging, high quality content.