Kodeclik Blog
How to clean paragraphs using Python
Text cleaning is one of the most common preprocessing steps in data analysis, natural language processing, and web scraping. When you extract text from files, APIs, or websites, the result often includes unwanted extra spaces, line breaks, or inconsistent formatting. These formatting errors can make text harder to analyze or store. In this guide, you’ll learn several Python techniques to quickly remove unnecessary whitespace and ensure your paragraphs are clean, readable, and consistent.
When will you need to clean paragraphs?
You may need to clean paragraphs when dealing with text data that includes excessive spacing or new lines. This often happens when processing text scraped from websites, reading data from poorly formatted files, cleaning user-generated content with uneven spacing, or preparing text for machine learning models or databases. Messy text can affect readability and break downstream processing, so cleaning it is a common preprocessing step in Python.
Method 1: Use split() and join() methods together
The simplest way to remove extra spaces is by splitting a string into words and then joining them back with a single space.
text = "This is an example paragraph.\n\nIt has extra spaces."
cleaned_text = ' '.join(text.split())
print(cleaned_text)
The output will be:
This is an example paragraph. It has extra spaces.
This method works by separating the text into tokens wherever whitespace occurs and then recombining them into a properly spaced sentence. Because Python’s split() function automatically ignores multiple or mixed whitespace characters, it effectively removes duplicates, line breaks, and extra tabs. When the words are joined again with ' '.join(), the result is a clean, single-spaced paragraph that is easy to read and ready for processing.
Method 2: Use regular expressions (re module)
Regular expressions can clean spaces and new lines more flexibly.
import re
text = "This is\tan example paragraph.\n\nIt has \t extra spaces."
# Replace multiple spaces or new lines with a single space
cleaned_text = re.sub(r'\s+', ' ', text).strip()
print(cleaned_text)
The output will be:
This is an example paragraph. It has extra spaces.
In this approach, the regular expression \s+ captures all sequences of whitespace characters, including spaces, tabs, and newline symbols, and replaces them with a single space. The .strip() function is then applied to remove any leading or trailing spaces that may still remain after substitution. This method provides a concise and powerful way to standardize whitespace throughout the text, making it ideal for more complex or inconsistent data sources.
Method 3: Use .splitlines() and list comprehension
If paragraphs contain extra blank lines, splitting by lines first can help.
text = """This is an example paragraph.
It has extra spaces and blank lines."""
lines = [line.strip() for line in text.splitlines() if line.strip()]
cleaned_text = ' '.join(lines)
print(cleaned_text)
This method begins by splitting the input into individual lines, trimming whitespace from each line, and discarding any that are empty. A list comprehension efficiently handles this filtering and cleanup step. Finally, the cleaned sections are rejoined with a single space, producing a neatly formatted paragraph without stray blank lines or inconsistent spacing. This approach is especially useful when dealing with block-style text or documents that include multiple line breaks.
The output will be:
This is an example paragraph. It has extra spaces and blank lines.
Note that the output still has extra spaces before "an example paragraph". This is because the strip() method, used in line.strip(), only removes spaces (and other whitespace characters) from the beginning (leading) and end (trailing) of each line—not the spaces between words within the line itself. Therefore, any extra spaces that exist between words in the line will remain untouched, since strip() does not collapse or remove internal whitespace in the string—only at the edges. You can of course write a separate function to remove such spaces (if you so desire).
Summary
All three methods offer effective ways to clean paragraphs in Python, each suited to a slightly different text preprocessing need. The split() and join() combination is the most straightforward solution for general cleanup tasks, while regular expressions provide more flexibility for irregular whitespace patterns. The line-based method using .splitlines() is useful for removing empty lines and consolidating paragraph structures. Together, these techniques can help ensure your textual data is standardized and ready for analysis or display.
Enjoy this blogpost? Want to learn Python with us? Sign up for 1:1 or small group classes.