Kodeclik Logo

Our Programs

Courses

Gifting

Learn More

Schedule

Kodeclik Blog

Finding consecutive letters in a Python string

Consider the string “UV light is essential for vitamin D production, and studies show that ABC proteins can be influenced by prolonged exposure to sunlight.We wish to find sequences of consecutive letters occurring in this string, such as “UV”, “ABC” (but there are more!) Here is how we can do it!

Finding consecutive letters in a Python string

Method 1: Use a for loop to find consecutive alphabetical occurrences

The first idea you can try is to use a for loop to iterate through the string:

sentence = "UV light is essential for vitamin D production, and studies show that ABC proteins can be influenced by prolonged exposure to sunlight."

# Find sequences of consecutive characters based on ASCII values
consecutive_sequences = []
current_sequence = sentence[0]  # Start with the first character

for i in range(1, len(sentence)):
    # Check if the ASCII value of the current character is one more than the previous character
    if ord(sentence[i]) == ord(sentence[i - 1]) + 1:
        current_sequence += sentence[i]  # Append to current sequence
    else:
        # If the current sequence is longer than 1 character, save it
        if len(current_sequence) > 1:
            consecutive_sequences.append(current_sequence)
        current_sequence = sentence[i]  # Start a new sequence

# Check if the last sequence is valid
if len(current_sequence) > 1:
    consecutive_sequences.append(current_sequence)

print(consecutive_sequences)

This program analyzes a text string by examining each character in sequence to find runs of consecutive ASCII characters. It starts with the first character and compares it with the next one, using the ord() function to get their ASCII values. When it finds two adjacent characters where the second character's ASCII value is exactly one more than the previous character's value (like 'A' followed by 'B', or 'r' followed by 's'), it builds a sequence by adding these characters together.

The program keeps extending such sequences as long as it finds consecutive ASCII values. When it encounters a character that breaks this pattern (i.e., its ASCII value isn't one more than the previous character), it checks if the sequence it has built so far is longer than one character. If it is, the sequence is added to a list of consecutive sequences, and then it starts building a new potential sequence beginning with the current character.

After checking all characters, it performs one final check on the last sequence built to see if it should be included in the results. This process effectively identifies patterns like "UV", "ABC", or "rst" where each letter follows the previous one in the ASCII table, regardless of whether the letters are uppercase or lowercase.

For the above string, the output will be:

['UV', 'gh', 'stu', 'ABC', 'gh']

As we hinted, there are more consecutive letter examples than UV and ABC!. Also note that ‘gh’ appears twice and is thus printed twice. If you so desire you can do some de-deduplication on the results to print that only once.

Method 2: Use a list comprehension to find consecutive alphabetical occurrences

Here's a more concise and Pythonic version using list comprehension:

sentence = "UV light is essential for vitamin D production, and studies show that ABC proteins can be influenced by prolonged exposure to sunlight."

# Find consecutive sequences using list comprehension
consecutive_chars = [sentence[i:i+j] for i in range(len(sentence)) 
                    for j in range(2, len(sentence)-i+1) 
                    if all(ord(sentence[i+k]) == ord(sentence[i+k-1]) + 1 
                          for k in range(1, j))]

# Get unique sequences and sort by length
consecutive_sequences = sorted(set(consecutive_chars), key=len, reverse=True)

print(consecutive_sequences)

This version uses a nested list comprehension generating all possible substrings and checks if they consist of consecutive ASCII characters The all() function checks if every pair of adjacent characters in the substring has consecutive ASCII values. Then set() removes duplicate sequences like 'gh' that appeared twice in the original code. The sorted() function with key=len and reverse=True orders sequences from longest to shortest.

The output will be:

['stu', 'ABC', 'st', 'BC', 'AB', 'UV', 'gh', 'tu']

Note that this version actually prints every substring that has consecutive letters rather than just the longest substring. If you wish to avoid this, here is some updated code:

sentence = "UV light is essential for vitamin D production, and studies show that ABC proteins can be influenced by prolonged exposure to sunlight."

# Find all consecutive sequences and filter out shorter ones that are part of longer sequences
consecutive_chars = [sentence[i:i+j] for i in range(len(sentence)) 
                    for j in range(2, len(sentence)-i+1) 
                    if all(ord(sentence[i+k]) == ord(sentence[i+k-1]) + 1 
                          for k in range(1, j))]

# Filter out subsequences
final_sequences = []
for seq in sorted(consecutive_chars, key=len, reverse=True):
    if not any(seq in other_seq for other_seq in final_sequences):
        final_sequences.append(seq)

print(final_sequences)

This improved version first finds all possible consecutive sequences just like before, then sorts them by length in descending order, and only adds a sequence to the final list if it's not contained within any sequence that's already in the final list.

For example, if we have 'ABC' in our final sequences, we won't add 'AB' or 'BC' because they're already part of 'ABC'. This gives us only the longest unique consecutive sequences without their substrings.
The output will now show only the maximal consecutive sequences like 'ABC' and 'UV', without including their shorter subsequences.

The output is:

['stu', 'ABC', 'UV', 'gh']

Method 3: Use zip() to find consecutive alphabetical occurrences

Here's a concise solution using zip() to find the longest consecutive character sequences.

sentence = "UV light is essential for vitamin D production, and studies show that ABC proteins can be influenced by prolonged exposure to sunlight."

# Create pairs and find consecutive sequences
sequences = []
current = ""

for a, b in zip(sentence, sentence[1:]):
    if ord(b) == ord(a) + 1:
        current = current + a if current else a
    else:
        if current:
            sequences.append(current + a)
            current = ""

# Add last sequence if exists
if current:
    sequences.append(current + sentence[-1])

# Sort by length and remove duplicates
final_sequences = sorted(set(sequences), key=len, reverse=True)

print(final_sequences)  

This version is more elegant because it uses zip() to create pairs of adjacent characters, which is more Pythonic than manual indexing. The code processes the string in a single pass, building sequences as it goes. When it finds characters with consecutive ASCII values, it builds the sequence, and when the consecutive pattern breaks, it saves the sequence if it exists. The final sorting and duplicate removal ensures we get only the longest unique sequences in descending order of length.

The output will be:

['stu', 'ABC', 'gh', 'UV']

Method 4: Use a lambda filter

Here’s a final approach:

string = "UV light is essential for vitamin D production, and studies show that ABC proteins can be influenced by prolonged exposure to sunlight."

# Create initial pairs
pairs = [(string[i], string[i+1]) for i in range(len(string)-1)]

# Filter for consecutive alphabetic characters
consecutive_pairs = filter(lambda pair: pair[0].isalpha() and 
                                     pair[1].isalpha() and 
                                     ord(pair[1]) - ord(pair[0]) == 1, pairs)

# Combine consecutive pairs into longer sequences
sequences = []
current_seq = []
for pair in consecutive_pairs:
    if not current_seq or pair[0] == current_seq[-1]:
        current_seq.extend(pair if not current_seq else [pair[1]])
    else:
        if len(current_seq) > 1:
            sequences.append(''.join(current_seq))
        current_seq = list(pair)

if len(current_seq) > 1:
    sequences.append(''.join(current_seq))

print(sequences)

This code like some of the other approaches begins by creating pairs of adjacent characters using a list comprehension, which generates tuples of neighboring characters from the input string. Then, it applies a filter using a lambda function that checks three conditions simultaneously: whether both characters in each pair are letters (using isalpha()), and whether the ASCII value of the second character is exactly one more than the first character.

After filtering, the code combines these pairs into longer sequences by maintaining a current_seq list that builds up sequences of consecutive characters. When processing each pair, if the first character of the pair matches the last character in the current sequence, it extends the sequence with the second character; otherwise, it saves the current sequence (if it's longer than one character) and starts a new one.

The code handles the final sequence separately to ensure it's not lost, and finally joins all characters in each sequence together and stores them in the sequences list. This approach effectively identifies patterns like "UV" or "ABC" where each letter follows the previous one in the alphabet, while maintaining the functional programming paradigm in its initial stages and transitioning to imperative code for the sequence building phase. The output will be:

['UV', 'gh', 'stu', 'ABC', 'gh']

Note the ordering difference (and of course you will need to remove duplicates - but we will leave this as an exercise for you).

Practical Applications

Where would you need to detect consecutive character sequences in strings? One common area is in security. Detecting sequential characters helps prevent weak passwords by identifying common patterns like 'abc123' or '12345' that make passwords vulnerable to dictionary attacks.

In general, we can use this idea to detect “sequential character spam”, where users might enter nonsense inputs in forms. Identifying and processing consecutive sequences helps in data cleaning by finding and removing unwanted character patterns, validating input data formats, and flagging potential data entry errors.

Enjoy this blogpost? Want to learn Python with us? Sign up for 1:1 or small group classes.

Kodeclik sidebar newsletter

Join our mailing list

Subscribe to get updates about our classes, camps, coupons, and more.

About

Kodeclik is an online coding academy for kids and teens to learn real world programming. Kids are introduced to coding in a fun and exciting way and are challeged to higher levels with engaging, high quality content.

Copyright @ Kodeclik 2024. All rights reserved.