I l@ve RuBoard

18.4 Regular Expression Matching

Splitting and joining strings is a simple way to process text, as long as it follows the format you expect. For more general text analysis tasks, Python provides regular expression matching utilities. Regular expressions (REs) are simply strings that define patterns to be matched against other strings. You supply a pattern and a string, and ask if the string matches your pattern. After a match, parts of the string matched by parts of the pattern are made available to your script. That is, matches not only give a yes/no answer, but they can pick out substrings as well.

Regular expression pattern strings can be complicated (let's be honest -- they can be downright gross to look at). But once you get the hang of them, they can replace larger hand-coded string search routines. In Python, regular expressions are not part of the syntax of the Python language itself, but are supported by extension modules that you must import to use. The modules define functions for compiling pattern strings into pattern objects, matching these objects against strings, and fetching matched substrings after a match.

Beyond those generalities, Python's regular expression story is complicated a little by history:

The regex module (old): In earlier Python releases, a module called regex was the standard (and only) RE module. It was fast and supported patterns coded in awg, grep, and emacs style, but it is now somewhat deprecated (though it will likely still be available for some time to come).
The re module (new): Today, you should use re, a new RE module for Python, that was introduced sometime around Python release 1.5. This module provides a much richer RE pattern syntax that tries to be close to that used to code patterns in the Perl language (yes, REs are a feature of Perl worth emulating). For instance, re supports the notions of named groups, character classes, and non-greedy matches -- RE pattern operators that match as few characters as possible (other RE pattern operators always match the longest possible substring).

Up until very recently, re was generally slower than regex, so you had to choose between speed and Perl-like RE syntax. Today, though, re has been optimized with the sre implementation, to the extent that regex no longer offers any clear advantages. Moreover, re in Python 2.0 now supports matching Unicode strings (strings with 16-bit wide characters for representing large character sets).

Because of this migration, I've recoded RE examples in this text to use the new re module instead of regex. The old regex-based versions are still available on the book's CD (see http://examples.oreilly.com/python2), in directory PP2E\lang\old-regex. If you find yourself having to migrate old regex code, you can also find a document describing the translation steps needed at http://www.python.org. Both modules' interfaces are similar, but re introduces a match object and changes pattern syntax in minor ways.

Having said that, I also want to warn you that REs are a complex topic that cannot be covered in depth here. If this area sparks your interest, the text Mastering Regular Expressions from O'Reilly is a good next step to take.

18.4.1 Using the re Module

The Python re module comes with functions that can search for patterns right away or make compiled pattern objects for running matches later. Pattern objects (and module search calls) in turn generate match objects, which contain information about successful matches and matched substrings. The next few sections describe the module's interfaces and some of the operators you can use to code patterns.

18.4.1.1 Module functions

The top level of the module provides functions for matching, substitution, pre-compiling, and so on:

compile(pattern [, flags]): Compile a RE pattern string into a regular expression object, for later matching. See the reference manual for the flags argument's meaning.
match(pattern, string [, flags]): If zero or more characters at start of string match the pattern string, return a corresponding MatchObject instance, or None if no match is found.
search(pattern, string [, flags]): Scan through string for a location matching pattern, and return a corresponding MatchObject instance, or None if no match is found.
split(pattern, string [, maxsplit]): Split string by occurrences of pattern. If capturing ( ) are used in the pattern, then occurrences of patterns or subpatterns are also returned.
sub(pattern, repl, string [, count]): Return the string obtained by replacing the (first count) leftmost non-overlapping occurrences of pattern (a string or a RE object) in string by repl.
subn(pattern, repl, string [, count]): Same as sub, but returns a tuple: (new-string, number-of-changes-made).

18.4.1.2 Compiled pattern objects

At the next level, pattern objects provide similar attributes, but the pattern string is implied. The re.compile function in the previous section is useful to optimize patterns that may be matched more than once (compiled patterns match faster). Pattern objects returned by re.compile have these sorts of attributes:

match(string [, pos] [, endpos])
search(string [, pos] [, endpos])
split(string [, maxsplit])
sub(repl, string [, count])
subn(repl, string [, count]): Same as the re functions, but the pattern is implied, and pos and endpos give start/end string indexes for the match.

18.4.1.3 Match objects

Finally, when a match or search function or method is successful, you get back a match object (None comes back on failed matches). Match objects export a set of attributes of their own, including:

group([g1, g2, ...]): Returns the substring that matched a parenthesized groups in the pattern.
groups( ): Returns a tuple of all groups' substrings of the match.
start([group]), end([group]): Indices of the start and end of the substring matched by group (or the entire matched string, if no group).
span([group]): Returns the two-item tuple: (start(group),end(group)).

18.4.1.4 Regular expression patterns

Regular expression strings are built up by concatenating single-character regular expression forms, shown in Table 18-1. The longest-matching string is usually matched by each form, except for the non-greedy operators. In the table, R means any regular expression form, C is a character, and N denotes a digit.

Table 18-1. re Pattern Syntax

.

Matches any character (including newline if DOTALL flag specified)

^

Matches start of the string (of every line in MULTILINE mode)

$

Matches end of the string (of every line in MULTILINE mode)

C

Any non-special character matches itself

R*

Zero or more of preceding regular expression R (as many as possible)

R+

One or more of preceding regular expression R (as many as possible)

R?

Zero or one occurrence of preceding regular expression R

R{m,n}

Matches from m to n repetitions of preceding regular expression R

R*?, R+?, R??, R{m,n}?

Same as *, +, and ? but matches as few characters/times as possible; these are known as non-greedy match operators (unlike others, they match and consume as few characters as possible)

[ ]

Defines character set: e.g. [a-zA-Z] to match all letters

[^ ]

Defines complemented character set: matches if char is not in set

\

Escapes special chars (e.g., *?+|( )) and introduces special sequences

\\

Matches a literal \ (write as \\\\ in pattern, or r\\)

R|R

Alternative: matches left or right R

RR

Concatenation: match both Rs

(R)

Matches any RE inside ( ), and forms a group (retains matched substring)

(?: R)

Same but doesn't delimit a group

(?= R)

Matches if R matches next, but doesn't consume any of the string (e.g., X (?=Y) matches X only if followed by Y)

(?! R)

Matches if R doesn't match next. Negative of (?=R)

(?P<name>R)

Matches any RE inside ( ), and delimits a named group

(?P=name)

Matches whatever text was matched by the earlier group named name

(?#...)

A comment; ignored

(?letter)

Set mode flag; letter is one of i, L, m, s, x (see library manual)

Within patterns, ranges and selections can be combined. For instance, [a-zA-Z0-9_]+ matches the longest possible string of one or more letters, digits, or underscores. Special characters can be escaped as usual in Python strings: [\t ]* matches zero or more tabs and spaces (i.e., it skips whitespace).

The parenthesized grouping construct, (R), lets you extract matched substrings after a successful match. The portion of the string matched by the expression in parentheses is retained in a numbered register. It's available through the group method of a match object after a successful match.

In addition to the entries in this table, special sequences in Table 18-2 can be used in patterns, too. Due to Python string rules, you sometimes must double up on backslashes (\\) or use Python raw strings (r'...') to retain backslashes in the pattern.

Table 18-2. re Special Sequences

\num

Match text of group num (numbered from 1)

\A

Matches only at the start of the string

\b

Empty string at word boundaries

\B

Empty string not at word boundary

\d

Any decimal digit (like [0-9])

\D

Any nondecimal digit character (like [^O-9])

\s

Any whitespace character (like [ \t\n\r\f\v])

\S

Any nonwhitespace character (like [^ \t\n\r\f\v])

\w

Any alphanumeric character (uses LOCALE flag)

\W

Any nonalphanumeric character (uses LOCALE flag)

\Z

Matches only at the end of the string

The Python library manual gives additional details. But to demonstrate how the re interfaces are typically used, we'll turn to some short examples.

18.4.2 Basic Patterns

To illustrate how to combine RE operators, let's start with a few short test files that match simple pattern forms. Comments in Example 18-3 describe the operations exercised; check Table 18-1 to see which operators are used in these patterns.

Example 18-3. PP2E\lang\re-basics.py

# literals, sets, ranges   (all print 2 = offset where pattern found)

import re                                  # the one to use today

pattern, string = "A.C.", "xxABCDxx"       # nonspecial chars match themself
matchobj = re.search(pattern, string)      # '.' means any one char
if matchobj:                               # search returns match object or None
    print matchobj.start(  )                 # start is index where matched

pattobj  = re.compile("A.*C.*")            # 'R*' means zero or more Rs
matchobj = pattobj.search("xxABCDxx")      # compile returns pattern obj
if matchobj:                               # patt.search returns match obj
    print matchobj.start(  )                     

# selection sets
print re.search(" *A.C[DE][D-F][^G-ZE]G\t+ ?", "..ABCDEFG\t..").start(  )

# alternatives
print re.search("A|XB|YC|ZD", "..AYCD..").start(  )  # R1|R2 means R1 or R2

# word boundaries
print re.search(r"\bABCD", "..ABCD ").start(  )      # \b means word boundary
print re.search(r"ABCD\b", "..ABCD ").start(  )      # use r'...' to escape '\'

Notice that there are a variety of ways to kick off a match with re -- by calling module search functions and by making compiled pattern objects. In either event, you can hang on to the resulting match object or not. All the print statements in this script show a result of 2 -- the offset where the pattern was found in the string. In the first test, for example, "A.C." matches the "ABCD" at offset 2 in the search string (i.e., after the first "xx"):

C:\...\PP2E\Lang>python re-basic.py
2
2
2
2
2
2

In Example 18-4, parts of the pattern strings enclosed in parentheses delimit groups ; the parts of the string they matched are available after the match.

Example 18-4. PP2E\lang\re-groups.py

# groups (extract substrings matched by REs in '(  )' parts)

import re

patt = re.compile("A(.)B(.)C(.)")                  # saves 3 substrings
mobj = patt.match("A0B1C2")                        # each '(  )' is a group, 1..n
print mobj.group(1), mobj.group(2), mobj.group(3)  # group(  ) gives substring

patt = re.compile("A(.*)B(.*)C(.*)")               # saves 3 substrings
mobj = patt.match("A000B111C222")                  # groups(  ) gives all groups
print mobj.groups(  )

print re.search("(A|X)(B|Y)(C|Z)D", "..AYCD..").groups(  )

patt = re.compile(r"[\t ]*#\s*define\s*([a-z0-9_]*)\s*(.*)") 
mobj = patt.search(" # define  spam  1 + 2 + 3")            # parts of C #define
print mobj.groups(  )                                         # \s is whitespace

In the first test here, for instance, the three (.) groups each match a single character, but retain the character matched; calling group pulls out the bits matched. The second test's (.*) groups match and retain any number of characters. The last test here matches C #define lines; more on this later.

C:\...\PP2E\Lang>python re-groups.py
0 1 2
('000', '111', '222')
('A', 'Y', 'C')
('spam', '1 + 2 + 3')

Finally, besides matches and substring extraction, re also includes tools for string replacement or substitution (see Example 18-5).

Example 18-5. PP2E\lang\re-subst.py

# substitutions (replace occurrences of patt with repl in string)

import re  
print re.sub('[ABC]', '*', 'XAXAXBXBXCXC')
print re.sub('[ABC]_', '*', 'XA-XA_XB-XB_XC-XC_')

In the first test, all characters in the set are replaced; in the second, they must be followed by an underscore:

C:\...\PP2E\Lang>python re-subst.py
X*X*X*X*X*X*
XA-X*XB-X*XC-X*

18.4.3 Scanning C Header Files for Patterns

The script in Example 18-6 puts these pattern operators to more practical use. It uses regular expressions to find #define and #include lines in C header files and extract their components. The generality of the patterns makes them detect a variety of line formats; pattern groups (the parts in parentheses) are used to extract matched substrings from a line after a match.

Example 18-6. PP2E\Lang\cheader.py

#! /usr/local/bin/python
import sys, re
from string import strip

pattDefine = re.compile(                               # compile to pattobj
    '^#[\t ]*define[\t ]+([a-zA-Z0-9_]+)[\t ]*(.*)')   # "# define xxx yyy..."

pattInclude = re.compile(
    '^#[\t ]*include[\t ]+[<"]([a-zA-Z0-9_/\.]+)')     # "# include <xxx>..."

def scan(file):
    count = 0
    while 1:                                     # scan line-by-line
        line = file.readline(  )
        if not line: break
        count = count + 1
        matchobj = pattDefine.match(line)        # None if match fails
        if matchobj:
            name = matchobj.group(1)             # substrings for (...) parts
            body = matchobj.group(2) 
            print count, 'defined', name, '=', strip(body)
            continue
        matchobj = pattInclude.match(line)
        if matchobj:
            start, stop = matchobj.span(1)       # start/stop indexes of (...) 
            filename = line[start:stop]          # slice out of line
            print count, 'include', filename     # same as matchobj.group(1)

if len(sys.argv) == 1:
    scan(sys.stdin)                    # no args: read stdin
else:
    scan(open(sys.argv[1], 'r'))       # arg: input file name

To test, let's run this script on the text file in Example 18-7.

Example 18-7. PP2E\Lang\test.h

#ifndef TEST_H
#define TEST_H

#include <stdio.h>
#include <lib/spam.h>
#  include   "Python.h"

#define DEBUG
#define HELLO 'hello regex world'
#  define SPAM    1234

#define EGGS sunny + side + up
#define  ADDER(arg) 123 + arg
#endif

Notice the spaces after # in some of these lines; regular expressions are flexible enough to account for such departures from the norm. Here is the script at work, picking out #include and #define lines and their parts; for each matched line, it prints the line number, the line type, and any matched substrings:

C:\...\PP2E\Lang>python cheader.py test.h
2 defined TEST_H =
4 include stdio.h
5 include lib/spam.h
6 include Python.h
8 defined DEBUG =
9 defined HELLO = 'hello regex world'
10 defined SPAM = 1234
12 defined EGGS = sunny + side + up
13 defined ADDER = (arg) 123 + arg

18.4.4 A File Pattern Search Utility

The next script searches for patterns in a set of files, much like the grep command-line program. We wrote file and directory searchers earlier, in Chapter 5. Here, the file searches look for patterns instead of simple strings (see Example 18-8). The patterns are typed interactively separated by a space, and the files to be searched are specified by an input pattern for Python's glob.glob filename expansion tool we studied earlier, too.

Example 18-8. PP2E\Lang\pygrep1.py

#!/usr/local/bin/python
import sys, re, glob
from string import split

help_string = """
Usage options.
interactive:  % pygrep1.py
"""

def getargs(  ):
    if len(sys.argv) == 1:
        return split(raw_input("patterns? >")), raw_input("files? >")
    else:
        try:
            return sys.argv[1], sys.argv[2]
        except:
            print help_string
            sys.exit(1)

def compile_patterns(patterns):
    res = []
    for pattstr in patterns:
        try:
            res.append(re.compile(pattstr))           # make re patt object 
        except:                                       # or use re.match 
            print 'pattern ignored:', pattstr 
    return res

def searcher(pattfile, srchfiles):
    patts = compile_patterns(pattfile)                  # compile for speed
    for file in glob.glob(srchfiles):                   # all matching files
        lineno = 1                                      # glob uses re too
        print '\n[%s]' % file
        for line in open(file, 'r').readlines(  ):        # all lines in file
            for patt in patts:
                if patt.search(line):                   # try all patterns
                    print '%04d)' % lineno, line,       # match if not None
                    break
            lineno = lineno+1

if __name__ == '__main__': 
    apply(searcher, getargs(  ))

Here's what a typical run of this script looks like; it searches all Python files in the current directory for two different patterns, compiled for speed. Notice that files are named by a pattern, too -- Python's glob module also uses reinternally:

C:\...\PP2E\Lang>python pygrep1.py
patterns? >import.*string spam
files? >*.py

[cheader.py]

[finder2.py]
0002) import string, glob, os, sys

[patterns.py]
0048) mobj = patt.search(" # define  spam  1 + 2 + 3")

[pygrep1.py]

[rules.py]

[summer.py]
0002) import string

[__init__.py]

I l@ve RuBoard