$ cat /posts/regular-expressions-in-python-pattern-matching-with-re-module.md

Regular Expressions in Python: Pattern Matching with re Module

drwxr-xr-x2026-01-185 min0 views

Regular expressions provide powerful pattern matching capabilities for searching, validating, and manipulating text through specialized syntax describing character sequences, positions, and repetitions. Python's re module implements regular expression operations enabling developers to match email addresses, validate phone numbers, extract data from logs, clean user input, parse structured text, and perform complex find-and-replace operations that would require extensive string manipulation code otherwise. Regular expressions use metacharacters with special meanings, quantifiers controlling repetition, character classes matching sets of characters, anchors specifying positions, and groups capturing matched portions, creating a compact domain-specific language for text processing that dramatically simplifies complex string operations.

This comprehensive guide explores the re module's core functions including re.search() finding first pattern matches, re.match() checking pattern presence at string starts, re.findall() extracting all matches, re.finditer() returning match objects for iteration, re.sub() performing pattern-based replacements, and re.split() splitting strings on pattern boundaries. Essential metacharacters including dot for any character, asterisk and plus for repetition, question mark for optionality, brackets for character classes, parentheses for groups, caret and dollar for anchors, and backslash for escaping provide building blocks for patterns. Character classes like \d for digits, \w for word characters, and \s for whitespace simplify common patterns. Groups and capturing with parentheses enable extracting matched portions and back-references. Practical applications span email validation, phone number extraction, log parsing, data cleaning, URL matching, and text processing. Best practices cover raw strings preventing escape conflicts, compiling frequently used patterns for performance, using verbose mode for complex patterns, and balancing regex power with code readability.

Core re Module Functions

The re module provides several functions for pattern matching, each serving specific use cases. The re.search() function scans the entire string finding the first match anywhere, re.match() checks only the string's beginning, re.findall() returns all matches as a list, re.finditer() returns an iterator of match objects, re.sub() replaces pattern matches with strings, and re.split() divides strings at pattern boundaries. Understanding when to use each function enables effective pattern matching for different requirements.

pythonre_basic_functions.py

# Core re Module Functions

import re

text = "The quick brown fox jumps over the lazy dog"

# === re.search() - Find first match anywhere ===
# Returns match object or None
match = re.search(r'fox', text)
if match:
    print(f"Found: {match.group()}")  # Output: Found: fox
    print(f"Position: {match.start()}-{match.end()}")  # Position: 16-19

# No match returns None
match = re.search(r'cat', text)
print(match)  # Output: None

# === re.match() - Match only at string start ===
# Returns match object only if pattern at beginning
match = re.match(r'The', text)
print(match.group() if match else "No match")  # Output: The

# This won't match (fox not at start)
match = re.match(r'fox', text)
print(match)  # Output: None

# === re.findall() - Find all matches ===
# Returns list of all matches
text = "Contact: 123-456-7890 or 987-654-3210"
phones = re.findall(r'\d{3}-\d{3}-\d{4}', text)
print(phones)  # Output: ['123-456-7890', '987-654-3210']

# Find all words
text = "Python is awesome!"
words = re.findall(r'\w+', text)
print(words)  # Output: ['Python', 'is', 'awesome']

# === re.finditer() - Iterator of match objects ===
# Returns iterator, useful for large results
text = "Error at line 10, Warning at line 25, Error at line 40"
for match in re.finditer(r'line (\d+)', text):
    print(f"Found at position {match.start()}: {match.group()}, Line: {match.group(1)}")
# Output:
# Found at position 9: line 10, Line: 10
# Found at position 29: line 25, Line: 25
# Found at position 49: line 40, Line: 40

# === re.sub() - Replace matches ===
# Substitute pattern with replacement
text = "Hello World! Hello Python!"
result = re.sub(r'Hello', 'Hi', text)
print(result)  # Output: Hi World! Hi Python!

# Replace with limit
result = re.sub(r'Hello', 'Hi', text, count=1)
print(result)  # Output: Hi World! Hello Python!

# Replace with function
def uppercase_match(match):
    return match.group().upper()

text = "hello world"
result = re.sub(r'\w+', uppercase_match, text)
print(result)  # Output: HELLO WORLD

# === re.split() - Split on pattern ===
# Split string by pattern
text = "apple,banana;orange:grape"
fruits = re.split(r'[,;:]', text)
print(fruits)  # Output: ['apple', 'banana', 'orange', 'grape']

# Split on whitespace
text = "Python   is    awesome"
words = re.split(r'\s+', text)
print(words)  # Output: ['Python', 'is', 'awesome']

# Split with limit
text = "one,two,three,four,five"
parts = re.split(r',', text, maxsplit=2)
print(parts)  # Output: ['one', 'two', 'three,four,five']

# === re.compile() - Compile pattern for reuse ===
# Compile once, use multiple times (more efficient)
pattern = re.compile(r'\d+')  # Match digits

text1 = "I have 10 apples"
text2 = "You have 5 oranges"

print(pattern.findall(text1))  # Output: ['10']
print(pattern.findall(text2))  # Output: ['5']

# === Match object methods ===
match = re.search(r'(\w+) (\d+)', "Python 3")
if match:
    print(match.group())    # Full match: Python 3
    print(match.group(0))   # Same as group(): Python 3
    print(match.group(1))   # First group: Python
    print(match.group(2))   # Second group: 3
    print(match.groups())   # All groups: ('Python', '3')
    print(match.start())    # Start position: 0
    print(match.end())      # End position: 8
    print(match.span())     # Span: (0, 8)

search() vs match(): Use re.search() to find patterns anywhere in strings. Use re.match() only when verifying strings start with specific patterns. Most use cases need search().

Metacharacters and Special Sequences

Metacharacters are characters with special meanings in regular expressions. The dot . matches any character except newline, asterisk * means zero or more repetitions, plus + means one or more, question mark ? means zero or one, brackets [] define character classes, parentheses () create groups, caret ^ anchors to start, dollar $ anchors to end, pipe | means alternation, and backslash \ escapes special characters. Understanding metacharacters enables building sophisticated patterns matching complex text structures.

pythonmetacharacters.py

# Metacharacters and Special Sequences

import re

# === Dot (.) - Matches any character except newline ===
pattern = r'c.t'
print(re.findall(pattern, "cat cot cut c@t c\nt"))
# Output: ['cat', 'cot', 'cut', 'c@t']
# Note: c\nt not matched (newline)

# === Asterisk (*) - Zero or more repetitions ===
pattern = r'ab*c'  # a followed by zero or more b's, then c
print(re.findall(pattern, "ac abc abbc abbbc"))
# Output: ['ac', 'abc', 'abbc', 'abbbc']

# === Plus (+) - One or more repetitions ===
pattern = r'ab+c'  # a followed by one or more b's, then c
print(re.findall(pattern, "ac abc abbc abbbc"))
# Output: ['abc', 'abbc', 'abbbc']
# Note: 'ac' not matched (needs at least one b)

# === Question mark (?) - Zero or one occurrence ===
pattern = r'colou?r'  # Optional 'u'
print(re.findall(pattern, "color colour"))
# Output: ['color', 'colour']

# === Curly braces {} - Specific repetition ===
pattern = r'\d{3}'  # Exactly 3 digits
print(re.findall(pattern, "123 45 6789"))
# Output: ['123', '678']

pattern = r'\d{2,4}'  # 2 to 4 digits
print(re.findall(pattern, "1 12 123 1234 12345"))
# Output: ['12', '123', '1234', '1234']

pattern = r'\d{3,}'  # 3 or more digits
print(re.findall(pattern, "12 123 1234"))
# Output: ['123', '1234']

# === Square brackets [] - Character class ===
pattern = r'[aeiou]'  # Match any vowel
print(re.findall(pattern, "hello world"))
# Output: ['e', 'o', 'o']

pattern = r'[a-z]'  # Match lowercase letters
print(re.findall(pattern, "Hello123"))
# Output: ['e', 'l', 'l', 'o']

pattern = r'[A-Z]'  # Match uppercase letters
print(re.findall(pattern, "Hello World"))
# Output: ['H', 'W']

pattern = r'[0-9]+'  # Match one or more digits
print(re.findall(pattern, "My age is 25 and zip is 12345"))
# Output: ['25', '12345']

pattern = r'[^0-9]'  # Match anything except digits (^ negates)
print(re.findall(pattern, "abc123"))
# Output: ['a', 'b', 'c']

# === Caret (^) - Start of string ===
pattern = r'^Hello'
print(re.search(pattern, "Hello World"))  # Match
print(re.search(pattern, "Say Hello"))    # None

# === Dollar ($) - End of string ===
pattern = r'World$'
print(re.search(pattern, "Hello World"))  # Match
print(re.search(pattern, "World Hello"))  # None

# === Pipe (|) - Alternation (OR) ===
pattern = r'cat|dog'
print(re.findall(pattern, "I have a cat and a dog"))
# Output: ['cat', 'dog']

pattern = r'(Mr|Ms|Dr)\. \w+'
print(re.findall(pattern, "Dr. Smith and Ms. Jones"))
# Output: ['Dr', 'Ms']

# === Backslash (\) - Escape special characters ===
pattern = r'\.'  # Match literal dot
print(re.findall(pattern, "example.com"))
# Output: ['.']

pattern = r'\$\d+'  # Match dollar sign followed by digits
print(re.findall(pattern, "Price: $100"))
# Output: ['$100']

# === Special sequences ===

# \d - Digit (equivalent to [0-9])
print(re.findall(r'\d+', "abc123def456"))
# Output: ['123', '456']

# \D - Non-digit (equivalent to [^0-9])
print(re.findall(r'\D+', "abc123def456"))
# Output: ['abc', 'def']

# \w - Word character (alphanumeric + underscore)
print(re.findall(r'\w+', "hello_world 123!"))
# Output: ['hello_world', '123']

# \W - Non-word character
print(re.findall(r'\W+', "hello world!"))
# Output: [' ', '!']

# \s - Whitespace (space, tab, newline)
print(re.split(r'\s+', "hello   world\tpython\ncode"))
# Output: ['hello', 'world', 'python', 'code']

# \S - Non-whitespace
print(re.findall(r'\S+', "hello world"))
# Output: ['hello', 'world']

# \b - Word boundary
pattern = r'\bcat\b'  # Match 'cat' as whole word
print(re.findall(pattern, "cat concatenate cats"))
# Output: ['cat']

# \B - Non-word boundary
pattern = r'\Bcat\B'  # Match 'cat' not as whole word
print(re.findall(pattern, "cat concatenate cats"))
# Output: ['cat'] (from concatenate)

Always Use Raw Strings: Prefix regex patterns with r like r'\d+' to avoid Python interpreting backslashes. Without raw strings, you'd need double backslashes: '\\d+'.

Groups and Capturing

Parentheses in regex create groups that capture matched portions for extraction or back-references. Groups enable extracting specific parts of matches like area codes from phone numbers, domains from URLs, or data fields from logs. Named groups using (?P...) syntax provide descriptive names for captured data improving code readability. Non-capturing groups (?:...) group without capturing when only grouping is needed without extraction.

pythongroups_capturing.py

# Groups and Capturing

import re

# === Basic groups with parentheses ===
pattern = r'(\d{3})-(\d{3})-(\d{4})'  # Phone number pattern
text = "Call me at 555-123-4567"
match = re.search(pattern, text)

if match:
    print(match.group())    # Full match: 555-123-4567
    print(match.group(0))   # Same: 555-123-4567
    print(match.group(1))   # First group: 555
    print(match.group(2))   # Second group: 123
    print(match.group(3))   # Third group: 4567
    print(match.groups())   # All groups: ('555', '123', '4567')

# === Extract multiple parts ===
pattern = r'(\w+)@(\w+)\.(\w+)'  # Email pattern
email = "[email protected]"
match = re.search(pattern, email)

if match:
    username = match.group(1)  # user
    domain = match.group(2)    # example
    tld = match.group(3)       # com
    print(f"User: {username}, Domain: {domain}, TLD: {tld}")

# === Named groups (?P<name>...) ===
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
date = "2024-03-15"
match = re.search(pattern, date)

if match:
    print(match.group('year'))   # 2024
    print(match.group('month'))  # 03
    print(match.group('day'))    # 15
    print(match.groupdict())     # {'year': '2024', 'month': '03', 'day': '15'}

# === Non-capturing groups (?:...) ===
# Group without capturing (more efficient when capture not needed)
pattern = r'(?:Mr|Ms|Dr)\. (\w+)'  # Don't capture title
text = "Dr. Smith and Ms. Jones"
matches = re.findall(pattern, text)
print(matches)  # Output: ['Smith', 'Jones'] (only names, not titles)

# With capturing (for comparison)
pattern = r'(Mr|Ms|Dr)\. (\w+)'
matches = re.findall(pattern, text)
print(matches)  # Output: [('Dr', 'Smith'), ('Ms', 'Jones')]

# === Back-references ===
# Reference captured groups within pattern
pattern = r'(\w+) \1'  # Match repeated words
text = "hello hello world world"
matches = re.findall(pattern, text)
print(matches)  # Output: ['hello', 'world']

# Detect duplicate words
pattern = r'\b(\w+)\s+\1\b'
text = "This is is a test test"
matches = re.findall(pattern, text)
print(matches)  # Output: ['is', 'test']

# === Groups in findall ===
# findall returns groups if pattern has groups
pattern = r'(\w+)@(\w+\.\w+)'  # Email with groups
text = "Contact: [email protected] or [email protected]"
matches = re.findall(pattern, text)
print(matches)
# Output: [('user', 'example.com'), ('admin', 'test.org')]

# Without groups, returns full match
pattern = r'\w+@\w+\.\w+'
matches = re.findall(pattern, text)
print(matches)
# Output: ['[email protected]', '[email protected]']

# === Groups in sub() ===
# Reference groups in replacement
pattern = r'(\d{3})-(\d{3})-(\d{4})'
text = "Phone: 555-123-4567"
result = re.sub(pattern, r'(\1) \2-\3', text)
print(result)  # Output: Phone: (555) 123-4567

# Named groups in replacement
pattern = r'(?P<area>\d{3})-(?P<prefix>\d{3})-(?P<line>\d{4})'
result = re.sub(pattern, r'(\g<area>) \g<prefix>-\g<line>', text)
print(result)  # Output: Phone: (555) 123-4567

# === Extract structured data ===
log = "2024-03-15 10:30:45 ERROR Database connection failed"
pattern = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<message>.*)'
match = re.search(pattern, log)

if match:
    data = match.groupdict()
    print(data)
    # Output: {'date': '2024-03-15', 'time': '10:30:45', 
    #          'level': 'ERROR', 'message': 'Database connection failed'}

# === Optional groups ===
pattern = r'(Mr|Ms|Dr)?\.? (\w+)'  # Optional title
text = "Dr. Smith, Ms. Jones, Alice"
matches = re.findall(pattern, text)
print(matches)
# Output: [('Dr', 'Smith'), ('Ms', 'Jones'), ('', 'Alice')]

Named Groups for Clarity: Use named groups (?P...) for complex patterns. match.group('email') is much clearer than match.group(3) when patterns have many groups.

Practical Applications

Regular expressions solve real-world text processing problems including validating email addresses and phone numbers, extracting URLs from text, parsing log files for error patterns, cleaning and normalizing user input, finding and replacing text with patterns, extracting data from structured text, and validating input formats. These applications demonstrate regex power for tasks requiring sophisticated pattern matching beyond simple string methods, enabling robust text processing in production applications.

pythonpractical_applications.py

# Practical Applications

import re

# === Email validation ===
def validate_email(email):
    """Validate email format."""
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email) is not None

print(validate_email("[email protected]"))     # True
print(validate_email("invalid.email"))        # False
print(validate_email("user@domain"))          # False

# === Phone number extraction ===
text = """
Contact us at 555-123-4567 or (555) 987-6543.
International: +1-555-456-7890
"""

patterns = [
    r'\d{3}-\d{3}-\d{4}',           # 555-123-4567
    r'\(\d{3}\)\s*\d{3}-\d{4}',    # (555) 123-4567
    r'\+\d{1,3}-\d{3}-\d{3}-\d{4}' # +1-555-123-4567
]

phones = []
for pattern in patterns:
    phones.extend(re.findall(pattern, text))

print("Phone numbers:", phones)
# Output: ['555-123-4567', '(555) 987-6543', '+1-555-456-7890']

# === URL extraction ===
text = "Visit https://example.com or http://test.org/page"
pattern = r'https?://[\w.-]+(?:/[\w.-]*)*'
urls = re.findall(pattern, text)
print("URLs:", urls)
# Output: ['https://example.com', 'http://test.org/page']

# === Log parsing ===
log_entries = """
2024-03-15 10:30:45 ERROR Database connection failed
2024-03-15 10:31:12 INFO User logged in: alice
2024-03-15 10:32:05 ERROR File not found: data.txt
2024-03-15 10:33:20 WARNING Memory usage high: 85%
"""

pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (\w+) (.*)'
errors = []

for line in log_entries.strip().split('\n'):
    match = re.search(pattern, line)
    if match and match.group(2) == 'ERROR':
        errors.append({
            'timestamp': match.group(1),
            'level': match.group(2),
            'message': match.group(3)
        })

print("Errors found:", len(errors))
for error in errors:
    print(f"{error['timestamp']}: {error['message']}")

# === Input sanitization ===
def sanitize_username(username):
    """Remove non-alphanumeric characters."""
    return re.sub(r'[^a-zA-Z0-9_]', '', username)

print(sanitize_username("user@123!"))  # Output: user123
print(sanitize_username("john_doe"))   # Output: john_doe

# === Password validation ===
def validate_password(password):
    """
    Validate password:
    - At least 8 characters
    - At least one uppercase letter
    - At least one lowercase letter
    - At least one digit
    - At least one special character
    """
    if len(password) < 8:
        return False
    if not re.search(r'[A-Z]', password):
        return False
    if not re.search(r'[a-z]', password):
        return False
    if not re.search(r'\d', password):
        return False
    if not re.search(r'[!@#$%^&*(),.?":{}|<>]', password):
        return False
    return True

print(validate_password("Weak123"))        # False (no special char)
print(validate_password("Strong@123"))     # True

# === Extract prices ===
text = "Items: $10.99, $25.50, and $100.00"
prices = re.findall(r'\$\d+\.\d{2}', text)
print("Prices:", prices)
# Output: ['$10.99', '$25.50', '$100.00']

# Convert to float
prices_float = [float(p.replace('$', '')) for p in prices]
print("Total:", sum(prices_float))  # Output: 136.49

# === Find hashtags ===
text = "Love #Python! #Coding is fun. #AI and #MachineLearning"
hashtags = re.findall(r'#\w+', text)
print("Hashtags:", hashtags)
# Output: ['#Python', '#Coding', '#AI', '#MachineLearning']

# === Replace sensitive data ===
def mask_credit_card(text):
    """Mask credit card numbers."""
    pattern = r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b'
    return re.sub(pattern, 'XXXX-XXXX-XXXX-XXXX', text)

text = "Card: 1234-5678-9012-3456 or 1111222233334444"
print(mask_credit_card(text))
# Output: Card: XXXX-XXXX-XXXX-XXXX or XXXX-XXXX-XXXX-XXXX

# === Extract markdown links ===
markdown = "Check [Python](https://python.org) and [GitHub](https://github.com)"
pattern = r'\[([^\]]+)\]\(([^)]+)\)'
links = re.findall(pattern, markdown)
for text, url in links:
    print(f"{text}: {url}")
# Output:
# Python: https://python.org
# GitHub: https://github.com

# === Convert camelCase to snake_case ===
def camel_to_snake(name):
    """Convert camelCase to snake_case."""
    pattern = r'(?<!^)(?=[A-Z])'
    return re.sub(pattern, '_', name).lower()

print(camel_to_snake("getUserName"))    # get_user_name
print(camel_to_snake("parseHTMLString")) # parse_h_t_m_l_string

Best Practices

Always use raw strings: Prefix regex patterns with r like r'\d+' to avoid Python interpreting backslashes as escape sequences
Compile frequently used patterns: Use re.compile() for patterns used multiple times. Compiled patterns are more efficient than re-parsing
Use verbose mode for complex patterns: The re.VERBOSE flag allows multi-line patterns with comments explaining each part
Be specific with patterns: Use \d instead of ., \s instead of space. Specific patterns are more reliable and faster
Use non-capturing groups when appropriate: Use (?:...) instead of (...) when you don't need to capture, saving memory
Avoid greedy quantifiers when possible: Use .*? (non-greedy) instead of .* (greedy) to match smallest possible strings
Test patterns thoroughly: Use online regex testers like regex101.com to test and debug patterns before implementing
Consider simpler alternatives first: String methods like str.startswith() or in are simpler and faster for basic checks
Use named groups for clarity: Named groups (?P...) make code more readable than numbered groups, especially in complex patterns
Document complex regex patterns: Add comments explaining what complex patterns match. Regex can be cryptic; documentation helps maintainability

When NOT to Use Regex: Don't parse HTML/XML with regex—use BeautifulSoup or lxml. Don't over-engineer simple string operations. Regex is powerful but not always the best tool.

Conclusion

Regular expressions provide powerful pattern matching through Python's re module enabling sophisticated text processing for searching, validating, extracting, and manipulating strings with compact specialized syntax. Core functions include re.search() finding first matches anywhere in strings, re.match() checking pattern presence at string starts, re.findall() returning all matches as lists, re.finditer() providing match object iterators for large results, re.sub() performing pattern-based replacements, and re.split() dividing strings at pattern boundaries. Metacharacters form pattern building blocks with dot matching any character, asterisk and plus controlling repetition, question mark indicating optionality, brackets defining character classes, parentheses creating groups, caret and dollar anchoring to positions, pipe providing alternation, and backslash escaping special characters. Special sequences like \d for digits, \w for word characters, \s for whitespace, and \b for word boundaries simplify common patterns avoiding verbose character class definitions.

Groups created with parentheses capture matched portions enabling data extraction with numbered or named references through match.group() and match.groupdict(), while non-capturing groups (?:...) provide grouping without extraction overhead. Practical applications demonstrate regex power for email validation checking format compliance, phone number extraction handling multiple formats, URL detection and parsing, log file parsing extracting structured data, input sanitization removing unwanted characters, password validation enforcing complexity rules, price extraction and summation, hashtag finding in social media text, sensitive data masking for security, and format conversion between naming conventions. Best practices emphasize always using raw strings preventing escape interpretation conflicts, compiling frequently used patterns improving performance, using verbose mode with comments for complex patterns enhancing maintainability, being specific with character classes ensuring reliability, using non-capturing groups when extraction isn't needed, avoiding greedy quantifiers preventing over-matching, testing patterns thoroughly with online tools, considering simpler string methods first for basic operations, using named groups improving code clarity, and documenting complex patterns explaining matching logic. By mastering re module functions for various matching scenarios, metacharacters and special sequences building patterns, groups capturing data portions, practical applications solving real text processing problems, and best practices ensuring maintainable efficient code, you gain essential tools for sophisticated text manipulation handling validation, extraction, transformation, and analysis tasks requiring pattern matching beyond simple string operations in professional Python development.