I l@ve RuBoard |
18.4 Regular Expression MatchingSplitting and joining strings is a simple way to process text, as long as it follows the format you expect. For more general text analysis tasks, Python provides regular expression matching utilities. Regular expressions (REs) are simply strings that define patterns to be matched against other strings. You supply a pattern and a string, and ask if the string matches your pattern. After a match, parts of the string matched by parts of the pattern are made available to your script. That is, matches not only give a yes/no answer, but they can pick out substrings as well. Regular expression pattern strings can be complicated (let's be honest -- they can be downright gross to look at). But once you get the hang of them, they can replace larger hand-coded string search routines. In Python, regular expressions are not part of the syntax of the Python language itself, but are supported by extension modules that you must import to use. The modules define functions for compiling pattern strings into pattern objects, matching these objects against strings, and fetching matched substrings after a match. Beyond those generalities, Python's regular expression story is complicated a little by history:
Up until very recently, re was generally slower than regex, so you had to choose between speed and Perl-like RE syntax. Today, though, re has been optimized with the sre implementation, to the extent that regex no longer offers any clear advantages. Moreover, re in Python 2.0 now supports matching Unicode strings (strings with 16-bit wide characters for representing large character sets). Because of this migration, I've recoded RE examples in this text to use the new re module instead of regex. The old regex-based versions are still available on the book's CD (see http://examples.oreilly.com/python2), in directory PP2E\lang\old-regex. If you find yourself having to migrate old regex code, you can also find a document describing the translation steps needed at http://www.python.org. Both modules' interfaces are similar, but re introduces a match object and changes pattern syntax in minor ways. Having said that, I also want to warn you that REs are a complex topic that cannot be covered in depth here. If this area sparks your interest, the text Mastering Regular Expressions from O'Reilly is a good next step to take. 18.4.1 Using the re ModuleThe Python re module comes with functions that can search for patterns right away or make compiled pattern objects for running matches later. Pattern objects (and module search calls) in turn generate match objects, which contain information about successful matches and matched substrings. The next few sections describe the module's interfaces and some of the operators you can use to code patterns. 18.4.1.1 Module functionsThe top level of the module provides functions for matching, substitution, pre-compiling, and so on:
18.4.1.2 Compiled pattern objectsAt the next level, pattern objects provide similar attributes, but the pattern string is implied. The re.compile function in the previous section is useful to optimize patterns that may be matched more than once (compiled patterns match faster). Pattern objects returned by re.compile have these sorts of attributes:
18.4.1.3 Match objectsFinally, when a match or search function or method is successful, you get back a match object (None comes back on failed matches). Match objects export a set of attributes of their own, including:
18.4.1.4 Regular expression patternsRegular expression strings are built up by concatenating single-character regular expression forms, shown in Table 18-1. The longest-matching string is usually matched by each form, except for the non-greedy operators. In the table, R means any regular expression form, C is a character, and N denotes a digit.
Within patterns, ranges and selections can be combined. For instance, [a-zA-Z0-9_]+ matches the longest possible string of one or more letters, digits, or underscores. Special characters can be escaped as usual in Python strings: [\t ]* matches zero or more tabs and spaces (i.e., it skips whitespace). The parenthesized grouping construct, (R), lets you extract matched substrings after a successful match. The portion of the string matched by the expression in parentheses is retained in a numbered register. It's available through the group method of a match object after a successful match. In addition to the entries in this table, special sequences in Table 18-2 can be used in patterns, too. Due to Python string rules, you sometimes must double up on backslashes (\\) or use Python raw strings (r'...') to retain backslashes in the pattern.
The Python library manual gives additional details. But to demonstrate how the re interfaces are typically used, we'll turn to some short examples. 18.4.2 Basic PatternsTo illustrate how to combine RE operators, let's start with a few short test files that match simple pattern forms. Comments in Example 18-3 describe the operations exercised; check Table 18-1 to see which operators are used in these patterns. Example 18-3. PP2E\lang\re-basics.py# literals, sets, ranges (all print 2 = offset where pattern found) import re # the one to use today pattern, string = "A.C.", "xxABCDxx" # nonspecial chars match themself matchobj = re.search(pattern, string) # '.' means any one char if matchobj: # search returns match object or None print matchobj.start( ) # start is index where matched pattobj = re.compile("A.*C.*") # 'R*' means zero or more Rs matchobj = pattobj.search("xxABCDxx") # compile returns pattern obj if matchobj: # patt.search returns match obj print matchobj.start( ) # selection sets print re.search(" *A.C[DE][D-F][^G-ZE]G\t+ ?", "..ABCDEFG\t..").start( ) # alternatives print re.search("A|XB|YC|ZD", "..AYCD..").start( ) # R1|R2 means R1 or R2 # word boundaries print re.search(r"\bABCD", "..ABCD ").start( ) # \b means word boundary print re.search(r"ABCD\b", "..ABCD ").start( ) # use r'...' to escape '\' Notice that there are a variety of ways to kick off a match with re -- by calling module search functions and by making compiled pattern objects. In either event, you can hang on to the resulting match object or not. All the print statements in this script show a result of 2 -- the offset where the pattern was found in the string. In the first test, for example, "A.C." matches the "ABCD" at offset 2 in the search string (i.e., after the first "xx"): C:\...\PP2E\Lang>python re-basic.py 2 2 2 2 2 2 In Example 18-4, parts of the pattern strings enclosed in parentheses delimit groups ; the parts of the string they matched are available after the match. Example 18-4. PP2E\lang\re-groups.py# groups (extract substrings matched by REs in '( )' parts) import re patt = re.compile("A(.)B(.)C(.)") # saves 3 substrings mobj = patt.match("A0B1C2") # each '( )' is a group, 1..n print mobj.group(1), mobj.group(2), mobj.group(3) # group( ) gives substring patt = re.compile("A(.*)B(.*)C(.*)") # saves 3 substrings mobj = patt.match("A000B111C222") # groups( ) gives all groups print mobj.groups( ) print re.search("(A|X)(B|Y)(C|Z)D", "..AYCD..").groups( ) patt = re.compile(r"[\t ]*#\s*define\s*([a-z0-9_]*)\s*(.*)") mobj = patt.search(" # define spam 1 + 2 + 3") # parts of C #define print mobj.groups( ) # \s is whitespace In the first test here, for instance, the three (.) groups each match a single character, but retain the character matched; calling group pulls out the bits matched. The second test's (.*) groups match and retain any number of characters. The last test here matches C #define lines; more on this later. C:\...\PP2E\Lang>python re-groups.py 0 1 2 ('000', '111', '222') ('A', 'Y', 'C') ('spam', '1 + 2 + 3') Finally, besides matches and substring extraction, re also includes tools for string replacement or substitution (see Example 18-5). Example 18-5. PP2E\lang\re-subst.py# substitutions (replace occurrences of patt with repl in string) import re print re.sub('[ABC]', '*', 'XAXAXBXBXCXC') print re.sub('[ABC]_', '*', 'XA-XA_XB-XB_XC-XC_') In the first test, all characters in the set are replaced; in the second, they must be followed by an underscore: C:\...\PP2E\Lang>python re-subst.py X*X*X*X*X*X* XA-X*XB-X*XC-X* 18.4.3 Scanning C Header Files for PatternsThe script in Example 18-6 puts these pattern operators to more practical use. It uses regular expressions to find #define and #include lines in C header files and extract their components. The generality of the patterns makes them detect a variety of line formats; pattern groups (the parts in parentheses) are used to extract matched substrings from a line after a match. Example 18-6. PP2E\Lang\cheader.py#! /usr/local/bin/python import sys, re from string import strip pattDefine = re.compile( # compile to pattobj '^#[\t ]*define[\t ]+([a-zA-Z0-9_]+)[\t ]*(.*)') # "# define xxx yyy..." pattInclude = re.compile( '^#[\t ]*include[\t ]+[<"]([a-zA-Z0-9_/\.]+)') # "# include <xxx>..." def scan(file): count = 0 while 1: # scan line-by-line line = file.readline( ) if not line: break count = count + 1 matchobj = pattDefine.match(line) # None if match fails if matchobj: name = matchobj.group(1) # substrings for (...) parts body = matchobj.group(2) print count, 'defined', name, '=', strip(body) continue matchobj = pattInclude.match(line) if matchobj: start, stop = matchobj.span(1) # start/stop indexes of (...) filename = line[start:stop] # slice out of line print count, 'include', filename # same as matchobj.group(1) if len(sys.argv) == 1: scan(sys.stdin) # no args: read stdin else: scan(open(sys.argv[1], 'r')) # arg: input file name To test, let's run this script on the text file in Example 18-7. Example 18-7. PP2E\Lang\test.h#ifndef TEST_H #define TEST_H #include <stdio.h> #include <lib/spam.h> # include "Python.h" #define DEBUG #define HELLO 'hello regex world' # define SPAM 1234 #define EGGS sunny + side + up #define ADDER(arg) 123 + arg #endif Notice the spaces after # in some of these lines; regular expressions are flexible enough to account for such departures from the norm. Here is the script at work, picking out #include and #define lines and their parts; for each matched line, it prints the line number, the line type, and any matched substrings: C:\...\PP2E\Lang>python cheader.py test.h 2 defined TEST_H = 4 include stdio.h 5 include lib/spam.h 6 include Python.h 8 defined DEBUG = 9 defined HELLO = 'hello regex world' 10 defined SPAM = 1234 12 defined EGGS = sunny + side + up 13 defined ADDER = (arg) 123 + arg 18.4.4 A File Pattern Search UtilityThe next script searches for patterns in a set of files, much like the grep command-line program. We wrote file and directory searchers earlier, in Chapter 5. Here, the file searches look for patterns instead of simple strings (see Example 18-8). The patterns are typed interactively separated by a space, and the files to be searched are specified by an input pattern for Python's glob.glob filename expansion tool we studied earlier, too. Example 18-8. PP2E\Lang\pygrep1.py#!/usr/local/bin/python import sys, re, glob from string import split help_string = """ Usage options. interactive: % pygrep1.py """ def getargs( ): if len(sys.argv) == 1: return split(raw_input("patterns? >")), raw_input("files? >") else: try: return sys.argv[1], sys.argv[2] except: print help_string sys.exit(1) def compile_patterns(patterns): res = [] for pattstr in patterns: try: res.append(re.compile(pattstr)) # make re patt object except: # or use re.match print 'pattern ignored:', pattstr return res def searcher(pattfile, srchfiles): patts = compile_patterns(pattfile) # compile for speed for file in glob.glob(srchfiles): # all matching files lineno = 1 # glob uses re too print '\n[%s]' % file for line in open(file, 'r').readlines( ): # all lines in file for patt in patts: if patt.search(line): # try all patterns print '%04d)' % lineno, line, # match if not None break lineno = lineno+1 if __name__ == '__main__': apply(searcher, getargs( )) Here's what a typical run of this script looks like; it searches all Python files in the current directory for two different patterns, compiled for speed. Notice that files are named by a pattern, too -- Python's glob module also uses reinternally: C:\...\PP2E\Lang>python pygrep1.py patterns? >import.*string spam files? >*.py [cheader.py] [finder2.py] 0002) import string, glob, os, sys [patterns.py] 0048) mobj = patt.search(" # define spam 1 + 2 + 3") [pygrep1.py] [rules.py] [summer.py] 0002) import string [__init__.py] |
I l@ve RuBoard |