I l@ve RuBoard |
5.4 Searching Directory TreesEngineers love to change things. As I was writing this book, I found it almost irresistible to move and rename directories, variables, and shared modules in the book examples tree, whenever I thought I'd stumbled on to a more coherent structure. That was fine early on, but as the tree became more intertwined, this became a maintenance nightmare. Things like program directory paths and module names were hardcoded all over the place -- in package import statements, program startup calls, text notes, configuration files, and more. One way to repair these references, of course, is to edit every file in the directory by hand, searching each for information that has changed. That's so tedious as to be utterly impossible in this book's examples tree, though; as I wrote these words, the example tree contained 118 directories and 1342 files! (To count for yourself, run a command-line python PyTools/visitor.py 1 in the PP2E examples root directory.) Clearly, I needed a way to automate updates after changes. 5.4.1 Greps and Globs in Shells and PythonThere is a standard way to search files for strings on Unix and Linux systems: the command-line program grep and its relatives list all lines in one or more files containing a string or string pattern.[7] Given that Unix shells expand (i.e., "glob") filename patterns automatically, a command such as grep popen *.py will search a single directory's Python files for string "popen". Here's such a command in action on Windows (I installed a commercial Unix-like fgrep program on my Windows 98 laptop because I missed it too much there):
C:\...\PP2E\System\Filetools>fgrep popen *.py diffall.py:# - we could also os.popen a diff (unix) or fc (dos) dirdiff.py:# - use os.popen('ls...') or glob.glob + os.path.split dirdiff6.py: files1 = os.popen('ls %s' % dir1).readlines( ) dirdiff6.py: files2 = os.popen('ls %s' % dir2).readlines( ) testdirdiff.py: expected = expected + os.popen(test % 'dirdiff').read( ) testdirdiff.py: output = output + os.popen(test % script).read( ) DOS has a command for searching files too -- find, not to be confused with the Unix find directory walker command: C:\...\PP2E\System\Filetools>find /N "popen" testdirdiff.py ---------- testdirdiff.py [8] expected = expected + os.popen(test % 'dirdiff').read( ) [15] output = output + os.popen(test % script).read( ) You can do the same within a Python script, by either running the previously mentioned shell command with os.system or os.popen, or combining the grep and glob built-in modules. We met the glob module in Chapter 2; it expands a filename pattern into a list of matching filename strings (much like a Unix shell). The standard library also includes a grep module, which acts like a Unix grep command: grep.grep prints lines containing a pattern string among a set of files. When used with glob, the effect is much like the fgrep command: >>> from grep import grep >>> from glob import glob >>> grep('popen', glob('*.py')) diffall.py: 16: # - we could also os.popen a diff (unix) or fc (dos) dirdiff.py: 12: # - use os.popen('ls...') or glob.glob + os.path.split dirdiff6.py: 19: files1 = os.popen('ls %s' % dir1).readlines( ) dirdiff6.py: 20: files2 = os.popen('ls %s' % dir2).readlines( ) testdirdiff.py: 8: expected = expected + os.popen(test % 'dirdiff')... testdirdiff.py: 15: output = output + os.popen(test % script).read( ) >>> import glob, grep >>> grep.grep('system', glob.glob('*.py')) dirdiff.py: 16: # - on unix systems we could do something similar by regtest.py: 18: os.system('%s < %s > %s.out 2>&1' % (program, ... regtest.py: 23: os.system('%s < %s > %s.out 2>&1' % (program, ... regtest.py: 24: os.system('diff %s.out %s.out.bkp > %s.diffs' ... The grep module is written in pure Python code (no shell commands are run), is completely portable, and accepts both simple strings and general regular expression patterns as the search key (regular expressions appear later in this text). Unfortunately, it is also limited in two major ways:
On Unix systems, we can work around the second of these limitations by running a grep shell command from within a find shell command. For instance, the following Unix command line: find . -name "*.py" -print -exec fgrep popen {} \; would pinpoint lines and files at and below the current directory that mention "popen". If you happen to have a Unix-like find command on every machine you will ever use, this is one way to process directories. 5.4.1.1 Cleaning up bytecode filesI used to run the script in Example 5-8 on some of my machines to remove all .pyc bytecode files in the examples tree before packaging or upgrading Pythons (it's not impossible that old binary bytecode files are not forward-compatible with newer Python releases). Example 5-8. PP2E\PyTools\cleanpyc.py########################################################### # find and delete all "*.pyc" bytecode files at and below # the directory where this script is run; this assumes a # Unix-like find command, and so is very non-portable; we # could instead use the Python find module, or just walk # the directry trees with portable Python code; the find # -exec option can apply a Python script to each file too; ########################################################### import os, sys if sys.platform[:3] == 'win': findcmd = r'c:\stuff\bin.mks\find . -name "*.pyc" -print' else: findcmd = 'find . -name "*.pyc" -print' print findcmd count = 0 for file in os.popen(findcmd).readlines( ): # for all file names count = count + 1 # have \n at the end print str(file[:-1]) os.remove(file[:-1]) print 'Removed %d .pyc files' % count This script uses os.popen to collect the output of a commercial package's find program installed on one of my Windows computers, or else the standard find tool on the Linux side. It's also completely nonportable to Windows machines that don't have the commercial find program installed, and that includes other computers in my house, and most of the world at large. Python scripts can reuse underlying shell tools with os.popen, but by so doing they lose much of the portability advantage of the Python language. The Unix find command is both not universally available, and is a complex tool by itself (in fact, too complex to cover in this book; see a Unix manpage for more details). As we saw in Chapter 2, spawning a shell command also incurs a performance hit, because it must start a new independent program on your computer. To avoid some of the portability and performance costs of spawning an underlying find command, I eventually recoded this script to use the find utilities we met and wrote Chapter 2. The new script is shown in Example 5-9. Example 5-9. PP2E\PyTools\cleanpyc-py.py########################################################### # find and delete all "*.pyc" bytecode files at and below # the directory where this script is run; this uses a # Python find call, and so is portable to most machines; # run this to delete .pyc's from an old Python release; # cd to the directory you want to clean before running; ########################################################### import os, sys, find # here, gets PyTools find count = 0 for file in find.find("*.pyc"): # for all file names count = count + 1 print file os.remove(file) print 'Removed %d .pyc files' % count This works portably, and avoids external program startup costs. But find is really just a tree-searcher that doesn't let you hook into the tree search -- if you need to do something unique while traversing a directory tree, you may be better off using a more manual approach. Moreover, find must collect all names before it returns; in very large directory trees, this may introduce significant performance and memory penalties. It's not an issue for my trees, but your trees may vary. 5.4.2 A Python Tree SearcherTo help ease the task of performing global searches on all platforms I might ever use, I coded a Python script to do most of the work for me. Example 5-10 employs standard Python tools we met in the preceding chapters:
Because it's pure Python code, though, it can be run the same way on both Linux and Windows. In fact, it should work on any computer where Python has been installed. Moreover, because it uses direct system calls, it will likely be faster than using op.popen to spawn a find command that spawns many grep commands. Example 5-10. PP2E\PyTools\search_all.py######################################################### # Use: "python ..\..\PyTools\search_all.py string". # search all files at and below current directory # for a string; uses the os.path.walk interface, # rather than doing a find to collect names first; ######################################################### import os, sys, string listonly = 0 skipexts = ['.gif', '.exe', '.pyc', '.o', '.a'] # ignore binary files def visitfile(fname, searchKey): # for each non-dir file global fcount, vcount # search for string print vcount+1, '=>', fname # skip protected files try: if not listonly: if os.path.splitext(fname)[1] in skipexts: print 'Skipping', fname elif string.find(open(fname).read( ), searchKey) != -1: raw_input('%s has %s' % (fname, searchKey)) fcount = fcount + 1 except: pass vcount = vcount + 1 def visitor(myData, directoryName, filesInDirectory): # called for each dir for fname in filesInDirectory: # do non-dir files here fpath = os.path.join(directoryName, fname) # fnames have no dirpath if not os.path.isdir(fpath): # myData is searchKey visitfile(fpath, myData) def searcher(startdir, searchkey): global fcount, vcount fcount = vcount = 0 os.path.walk(startdir, visitor, searchkey) if __name__ == '__main__': searcher('.', sys.argv[1]) print 'Found in %d files, visited %d' % (fcount, vcount) This file also uses the sys.argv command-line list and the __name__ trick for running in two modes. When run standalone, the search key is passed on the command line; when imported, clients call this module's searcher function directly. For example, to search (grep) for all appearances of directory name "Part2" in the examples tree (an old directory that really did go away!), run a command line like this in a DOS or Unix shell: C:\...\PP2E>python PyTools\search_all.py Part2 1 => .\autoexec.bat 2 => .\cleanall.csh 3 => .\echoEnvironment.pyw 4 => .\Launcher.py .\Launcher.py has Part2 5 => .\Launcher.pyc Skipping .\Launcher.pyc 6 => .\Launch_PyGadgets.py 7 => .\Launch_PyDemos.pyw 8 => .\LaunchBrowser.out.txt .\LaunchBrowser.out.txt has Part2 9 => .\LaunchBrowser.py .\LaunchBrowser.py has Part2 ... ...more lines deleted ... 1339 => .\old_Part2\Basics\unpack2b.py 1340 => .\old_Part2\Basics\unpack3.py 1341 => .\old_Part2\Basics\__init__.py Found in 74 files, visited 1341 The script lists each file it checks as it goes, tells you which files it is skipping (names that end in extensions listed in variable skipexts that imply binary data), and pauses for an Enter key press each time it announces a file containing the search string (bold lines). A solution based on find could not pause this way; although trivial in this example, find doesn't return until the entire tree traversal is finished. The search_all script works the same when imported instead of run, but there is no final statistics output line (fcount and vcount live in the module, and so would have to be imported to be inspected here): >>> from PP2E.PyTools.search_all import searcher >>> searcher('.', '-exec') # find files with string '-exec' 1 => .\autoexec.bat 2 => .\cleanall.csh 3 => .\echoEnvironment.pyw 4 => .\Launcher.py 5 => .\Launcher.pyc Skipping .\Launcher.pyc 6 => .\Launch_PyGadgets.py 7 => .\Launch_PyDemos.pyw 8 => .\LaunchBrowser.out.txt 9 => .\LaunchBrowser.py 10 => .\Launch_PyGadgets_bar.pyw 11 => .\makeall.csh 12 => .\package.csh .\package.csh has -exec ...more lines deleted... However launched, this script tracks down all references to a string in an entire directory tree -- a name of a changed book examples file, object, or directory, for instance.[9]
|
I l@ve RuBoard |