I l@ve RuBoard |
5.2 Fixing DOS Line EndsWhen I wrote the first edition of this book, I shipped two copies of every example file on the CD-ROM (view CD-ROM content online at http://examples.oreilly.com/python2) -- one with Unix line-end markers, and one with DOS markers. The idea was that this would make it easy to view and edit the files on either platform. Readers would simply copy the examples directory tree designed for their platform onto their hard drive, and ignore the other one. If you read Chapter 2, you know the issue here: DOS (and by proxy, Windows) marks line ends in text files with the two characters \r\n (carriage-return, line-feed), but Unix uses just a single \n. Most modern text editors don't care -- they happily display text files encoded in either format. Some tools are less forgiving, though. I still occasionally see odd \r characters when viewing DOS files on Unix, or an entire file in a single line when looking at Unix files on DOS (the Notepad accessory does this on Windows, for example). Because this is only an occasional annoyance, and because it's easy to forget to keep two distinct example trees in sync, I adopted a different policy for this second edition: we're shipping a single copy of the examples (in DOS format), along with a portable converter tool for changing to and from other line-end formats. The main obstacle, of course, is how to go about providing a portable and easy to use converter -- one that runs "out of the box" on almost every computer, without changes or recompiles. Some Unix platforms have commands like fromdos and dos2unix, but they are not universally available even on Unix. DOS batch files and csh scripts could do the job on Windows and Unix, respectively, but neither solution works on both platforms. Fortunately, Python does. The scripts presented in Examples Example 5-1, Example 5-3, and Example 5-4 convert end-of-line markers between DOS and Unix formats; they convert a single file, a directory of files, and a directory tree of files. In this section, we briefly look at each of the three scripts, and contrast some of the system tools they apply. Each reuses the prior's code, and becomes progressively more powerful in the process. The last of these three scripts, Example 5-4, is the portable converter tool I was looking for; it converts line ends in the entire examples tree, in a single step. Because it is pure Python, it also works on both DOS and Unix unchanged; as long as Python is installed, it is the only line converter you may ever need to remember. 5.2.1 Converting Line Ends in One FileThese three scripts were developed in stages on purpose, so I could first focus on getting line-feed conversions right, before worrying about directories and tree walking logic. With that scheme in mind, Example 5-1 addresses just the task of converting lines in a single text file. Example 5-1. PP2E\PyTools\fixeoln_one.py################################################################### # Use: "python fixeoln_one.py [tounix|todos] filename". # Convert end-of-lines in the single text file whose name is passed # in on the command line, to the target format (tounix or todos). # The _one, _dir, and _all converters reuse the convert function # here. convertEndlines changes end-lines only if necessary: # lines that are already in the target format are left unchanged, # so it's okay to convert a file > once with any of the 3 fixeoln # scripts. Notes: must use binary file open modes for this to # work on Windows, else default text mode automatically deletes # the \r on reads, and adds an extra \r for each \n on writes; # Mac format not supported; PyTools\dumpfile.py shows raw bytes; ################################################################### import os listonly = 0 # 1=show file to be changed, don't rewrite def convertEndlines(format, fname): # convert one file if not os.path.isfile(fname): # todos: \n => \r\n print 'Not a text file', fname # tounix: \r\n => \n return # skip directory names newlines = [] changed = 0 for line in open(fname, 'rb').readlines( ): # use binary i/o modes if format == 'todos': # else \r lost on Win if line[-1:] == '\n' and line[-2:-1] != '\r': line = line[:-1] + '\r\n' changed = 1 elif format == 'tounix': # avoids IndexError if line[-2:] == '\r\n': # slices are scaled line = line[:-2] + '\n' changed = 1 newlines.append(line) if changed: try: # might be read-only print 'Changing', fname if not listonly: open(fname, 'wb').writelines(newlines) except IOError, why: print 'Error writing to file %s: skipped (%s)' % (fname, why) if __name__ == '__main__': import sys errmsg = 'Required arguments missing: ["todos"|"tounix"] filename' assert (len(sys.argv) == 3 and sys.argv[1] in ['todos', 'tounix']), errmsg convertEndlines(sys.argv[1], sys.argv[2]) print 'Converted', sys.argv[2] This script is fairly straightforward as system utilities go; it relies primarily on the built-in file object's methods. Given a target format flag and filename, it loads the file into a lines list using the readlines method, converts input lines to the target format if needed, and writes the result back to the file with the writelines method if any lines were changed: C:\temp\examples>python %X%\PyTools\fixeoln_one.py tounix PyDemos.pyw Changing PyDemos.pyw Converted PyDemos.pyw C:\temp\examples>python %X%\PyTools\fixeoln_one.py todos PyDemos.pyw Changing PyDemos.pyw Converted PyDemos.pyw C:\temp\examples>fc PyDemos.pyw %X%\PyDemos.pyw Comparing files PyDemos.pyw and C:\PP2ndEd\examples\PP2E\PyDemos.pyw FC: no differences encountered C:\temp\examples>python %X%\PyTools\fixeoln_one.py todos PyDemos.pyw Converted PyDemos.pyw C:\temp\examples>python %X%\PyTools\fixeoln_one.py toother nonesuch.txt Traceback (innermost last): File "C:\PP2ndEd\examples\PP2E\PyTools\fixeoln_one.py", line 45, in ? assert (len(sys.argv) == 3 and sys.argv[1] in ['todos', 'tounix']), errmsg AssertionError: Required arguments missing: ["todos"|"tounix"] filename Here, the first command converts the file to Unix line-end format (tounix), and the second and fourth convert to the DOS convention -- all regardless of the platform on which this script is run. To make typical usage easier, converted text is written back to the file in place, instead of to a newly created output file. Notice that this script's filename has a "_" in it, not a "-"; because it is meant to be both run as a script and imported as a library, its filename must translate to a legal Python variable name in importers (fixeoln-one.py won't work for both roles).
5.2.1.1 Slinging bytes and verifying resultsThe fc DOS file-compare command in the preceding interaction confirms the conversions, but to better verify the results of this Python script, I wrote another, shown in Example 5-2. Example 5-2. PP2E\PyTools\dumpfile.pyimport sys bytes = open(sys.argv[1], 'rb').read( ) print '-'*40 print repr(bytes) print '-'*40 while bytes: bytes, chunk = bytes[4:], bytes[:4] # show 4-bytes per line for c in chunk: print oct(ord(c)), '\t', # show octal of binary value print print '-'*40 for line in open(sys.argv[1], 'rb').readlines( ): print repr(line) To give a clear picture of a file's contents, this script opens a file in binary mode (to suppress automatic line-feed conversions), prints its raw contents (bytes) all at once, displays the octal numeric ASCII codes of it contents four bytes per line, and shows its raw lines. Let's use this to trace conversions. First of all, use a simple text file to make wading through bytes a bit more humane: C:\temp>type test.txt a b c C:\temp>python %X%\PyTools\dumpfile.py test.txt ---------------------------------------- 'a\015\012b\015\012c\015\012' ---------------------------------------- 0141 015 012 0142 015 012 0143 015 012 ---------------------------------------- 'a\015\012' 'b\015\012' 'c\015\012' The test.txt file here is in DOS line-end format -- the escape sequence \015\012 displayed by the dumpfile script is simply the DOS \r\n line-end marker in octal character-code escapes format. Now, converting to Unix format changes all the DOS \r\n markers to a single \n (\012) as advertised: C:\temp>python %X%\PyTools\fixeoln_one.py tounix test.txt Changing test.txt Converted test.txt C:\temp>python %X%\PyTools\dumpfile.py test.txt ---------------------------------------- 'a\012b\012c\012' ---------------------------------------- 0141 012 0142 012 0143 012 ---------------------------------------- 'a\012' 'b\012' 'c\012' And converting back to DOS restores the original file format: C:\temp>python %X%\PyTools\fixeoln_one.py todos test.txt Changing test.txt Converted test.txt C:\temp>python %X%\PyTools\dumpfile.py test.txt ---------------------------------------- 'a\015\012b\015\012c\015\012' ---------------------------------------- 0141 015 012 0142 015 012 0143 015 012 ---------------------------------------- 'a\015\012' 'b\015\012' 'c\015\012' C:\temp>python %X%\PyTools\fixeoln_one.py todos test.txt # makes no changes Converted test.txt 5.2.1.2 Nonintrusive conversionsNotice that no "Changing" message is emitted for the last command just run, because no changes were actually made to the file (it was already in DOS format). Because this program is smart enough to avoid converting a line that is already in the target format, it is safe to rerun on a file even if you can't recall what format the file already uses. More naive conversion logic might be simpler, but may not be repeatable. For instance, a string.replace call can be used to expand a Unix \n to a DOS \r\n (\015\012), but only once: >>> import string >>> lines = 'aaa\nbbb\nccc\n' >>> lines = string.replace(lines, '\n', '\r\n') # okay: \r added >>> lines 'aaa\015\012bbb\015\012ccc\015\012' >>> lines = string.replace(lines, '\n', '\r\n') # bad: double \r >>> lines 'aaa\015\015\012bbb\015\015\012ccc\015\015\012' Such logic could easily trash a file if applied to it twice.[1] To really understand how the script gets around this problem, though, we need to take a closer look at its use of slices and binary file modes.
5.2.1.3 Slicing strings out-of-boundsThis script relies on subtle aspects of string slicing behavior to inspect parts of each line without size checks. For instance:
Because out-of-bounds slices scale slice limits to be in-bounds, the script doesn't need to add explicit tests to guarantee that the line is big enough to have end-line characters at the end. For example: >>> 'aaaXY'[-2:], 'XY'[-2:], 'Y'[-2:], ''[-2:] ('XY', 'XY', 'Y', '') >>> 'aaaXY'[-2:-1], 'XY'[-2:-1], 'Y'[-2:-1], ''[-2:-1] ('X', 'X', '', '') >>> 'aaaXY'[:-2], 'aaaY'[:-1], 'XY'[:-2], 'Y'[:-1] ('aaa', 'aaa', '', '') If you imagine characters like \r and \n instead of the X and Y here, you'll understand how the script exploits slice scaling to good effect. 5.2.1.4 Binary file mode revisitedBecause this script aims to be portable to Windows, it also takes care to open files in binary mode, even though they contain text data. As we've seen, when files are opened in text mode on Windows, \r is stripped from \r\n markers on input, and \r is added before \n markers on output. This automatic conversion allows scripts to represent the end-of-line marker as \n on all platforms. Here, though, it would also mean that the script would never see the \r it's looking for to detect a DOS-encoded line -- the \r would be dropped before it ever reached the script: >>> open('temp.txt', 'w').writelines(['aaa\n', 'bbb\n']) >>> open('temp.txt', 'rb').read( ) 'aaa\015\012bbb\015\012' >>> open('temp.txt', 'r').read( ) 'aaa\012bbb\012' Without binary open mode, this can lead to fairly subtle and incorrect behavior on Windows. For example, if files are opened in text mode, converting in "todos" mode on Windows would actually produce double \r characters: the script might convert the stripped \n to \r\n, which is then expanded on output to \r\r\n ! >>> open('temp.txt', 'w').writelines(['aaa\r\n', 'bbb\r\n']) >>> open('temp.txt', 'rb').read( ) 'aaa\015\015\012bbb\015\015\012' With binary mode, the script inputs a full \r\n, so no conversion is performed. Binary mode is also required for output on Windows, to suppress the insertion of \r characters; without it, the "tounix" conversion would fail on that platform.[2]
If all that is too subtle to bear, just remember to use the "b" in file open mode strings if your scripts might be run on Windows, and you mean to process either true binary data or text data as it is actually stored in the file.
5.2.2 Converting Line Ends in One DirectoryArmed with a fully debugged single file converter, it's an easy step to add support for converting all files in a single directory. Simply call the single file converter on every filename returned by a directory listing tool. The script in Example 5-3 uses the glob module we met in Chapter 2Chapter 2 to grab a list of files to convert. Example 5-3. PP2E\PyTools\fixeoln_dir.py######################################################### # Use: "python fixeoln_dir.py [tounix|todos] patterns?". # convert end-lines in all the text files in the current # directory (only: does not recurse to subdirectories). # Reuses converter in the single-file _one version. ######################################################### import sys, glob from fixeoln_one import convertEndlines listonly = 0 patts = ['*.py', '*.pyw', '*.txt', '*.cgi', '*.html', # text file names '*.c', '*.cxx', '*.h', '*.i', '*.out', # in this package 'README*', 'makefile*', 'output*', '*.note'] if __name__ == '__main__': errmsg = 'Required first argument missing: "todos" or "tounix"' assert (len(sys.argv) >= 2 and sys.argv[1] in ['todos', 'tounix']), errmsg if len(sys.argv) > 2: # glob anyhow: '*' not applied on dos patts = sys.argv[2:] # though not really needed on linux filelists = map(glob.glob, patts) # name matches in this dir only count = 0 for list in filelists: for fname in list: if listonly: print count+1, '=>', fname else: convertEndlines(sys.argv[1], fname) count = count + 1 print 'Visited %d files' % count This module defines a list, patts, containing filename patterns that match all the kinds of text files that appear in the book examples tree; each pattern is passed to the built-in glob.glob call by map, to be separately expanded into a list of matching files. That's why there are nested for loops near the end -- the outer loop steps through each glob result list, and the inner steps through each name within each list. Try the map call interactively if this doesn't make sense: >>> import glob >>> map(glob.glob, ['*.py', '*.html']) [['helloshell.py'], ['about-pp.html', 'about-pp2e.html', 'about-ppr2e.html']] This script requires a convert mode flag on the command line, and assumes that it is run in the directory where files to be converted live; cd to the directory to be converted before running this script (or change it to accept a directory name argument too): C:\temp\examples>python %X%\PyTools\fixeoln_dir.py tounix Changing Launcher.py Changing Launch_PyGadgets.py Changing LaunchBrowser.py ...lines deleted... Changing PyDemos.pyw Changing PyGadgets_bar.pyw Changing README-PP2E.txt Visited 21 files C:\temp\examples>python %X%\PyTools\fixeoln_dir.py todos Changing Launcher.py Changing Launch_PyGadgets.py Changing LaunchBrowser.py ...lines deleted... Changing PyDemos.pyw Changing PyGadgets_bar.pyw Changing README-PP2E.txt Visited 21 files C:\temp\examples>python %X%\PyTools\fixeoln_dir.py todos # makes no changes Visited 21 files C:\temp\examples>fc PyDemos.pyw %X%\PyDemos.pyw Comparing files PyDemos.pyw and C:\PP2ndEd\examples\PP2E\PyDemos.pyw FC: no differences encountered Notice that the third command generated no "Changing" messages again. Because the convertEndlines function of the single-file module is reused here to perform the actual updates, this script inherits that function's repeatability : it's okay to rerun this script on the same directory any number of times. Only lines that require conversion will be converted. This script also accepts an optional list of filename patterns on the command line, to override the default patts list of files to be changed: C:\temp\examples>python %X%\PyTools\fixeoln_dir.py tounix *.pyw *.csh Changing echoEnvironment.pyw Changing Launch_PyDemos.pyw Changing Launch_PyGadgets_bar.pyw Changing PyDemos.pyw Changing PyGadgets_bar.pyw Changing cleanall.csh Changing makeall.csh Changing package.csh Changing setup-pp.csh Changing setup-pp-embed.csh Changing xferall.linux.csh Visited 11 files C:\temp\examples>python %X%\PyTools\fixeoln_dir.py tounix *.pyw *.csh Visited 11 files Also notice that the single-file script's convertEndlines function performs an initial os.path.isfile test to make sure the passed-in filename represents a file, not a directory; when we start globbing with patterns to collect files to convert, it's not impossible that a pattern's expansion might include the name of a directory along with the desired files.
5.2.3 Converting Line Ends in an Entire TreeFinally, Example 5-4 applies what we've already learned to an entire directory tree. It simply runs the file-converter function to every filename produced by tree-walking logic. In fact, this script really just orchestrates calls to the original and already debugged convertEndlines function. Example 5-4. PP2E\PyTools\fixeoln_all.py######################################################### # Use: "python fixeoln_all.py [tounix|todos] patterns?". # find and convert end-of-lines in all text files at and # below the directory where this script is run (the dir # you are in when you type 'python'). If needed, tries to # use the Python find.py library module, else reads the # output of a unix-style find command; uses a default # filename patterns list if patterns argument is absent. # This script only changes files that need to be changed, # so it's safe to run brute-force from a root-level dir. ######################################################### import os, sys, string debug = 0 pyfind = 0 # force py find listonly = 0 # 1=show find results only def findFiles(patts, debug=debug, pyfind=pyfind): try: if sys.platform[:3] == 'win' or pyfind: print 'Using Python find' try: import find # use python-code find.py except ImportError: # use mine if deprecated! from PP2E.PyTools import find # may get from my dir anyhow matches = map(find.find, patts) # startdir default = '.' else: print 'Using find executable' matches = [] for patt in patts: findcmd = 'find . -name "%s" -print' % patt # run find command lines = os.popen(findcmd).readlines( ) # remove endlines matches.append(map(string.strip, lines)) # lambda x: x[:-1] except: assert 0, 'Sorry - cannot find files' if debug: print matches return matches if __name__ == '__main__': from fixeoln_dir import patts from fixeoln_one import convertEndlines errmsg = 'Required first argument missing: "todos" or "tounix"' assert (len(sys.argv) >= 2 and sys.argv[1] in ['todos', 'tounix']), errmsg if len(sys.argv) > 2: # quote in unix shell patts = sys.argv[2:] # else tries to expand matches = findFiles(patts) count = 0 for matchlist in matches: # a list of lists for fname in matchlist: # one per pattern if listonly: print count+1, '=>', fname else: convertEndlines(sys.argv[1], fname) count = count + 1 print 'Visited %d files' % count On Windows, the script uses the portable find.find built-in tool we met in Chapter 2 (either Python's or the hand-rolled equivalent)[3] to generate a list of all matching file and directory names in the tree; on other platforms, it resorts to spawning a less portable and probably slower find shell command just for illustration purposes.
Once the file pathname lists are compiled, this script simply converts each found file in turn using the single-file converter module's tools. Here is the collection of scripts at work converting the book examples tree on Windows; notice that this script also processes the current working directory (CWD; cd to the directory to be converted before typing the command line), and that Python treats forward and backward slashes the same in the program filename: C:\temp\examples>python %X%/PyTools/fixeoln_all.py tounix Using Python find Changing .\LaunchBrowser.py Changing .\Launch_PyGadgets.py Changing .\Launcher.py Changing .\Other\cgimail.py ...lots of lines deleted... Changing .\EmbExt\Exports\ClassAndMod\output.prog1 Changing .\EmbExt\Exports\output.prog1 Changing .\EmbExt\Regist\output Visited 1051 files C:\temp\examples>python %X%/PyTools/fixeoln_all.py todos Using Python find Changing .\LaunchBrowser.py Changing .\Launch_PyGadgets.py Changing .\Launcher.py Changing .\Other\cgimail.py ...lots of lines deleted... Changing .\EmbExt\Exports\ClassAndMod\output.prog1 Changing .\EmbExt\Exports\output.prog1 Changing .\EmbExt\Regist\output Visited 1051 files C:\temp\examples>python %X%/PyTools/fixeoln_all.py todos Using Python find Not a text file .\Embed\Inventory\Output Not a text file .\Embed\Inventory\WithDbase\Output Visited 1051 files The first two commands convert over 1000 files, and usually take some eight seconds of real-world time to finish on my 650 MHz Windows 98 machine; the third takes only six seconds, because no files have to be updated (and fewer messages have to be scrolled on the screen). Don't take these figures too seriously, though; they can vary by system load, and much of this time is probably spent scrolling the script's output to the screen. 5.2.3.1 The view from the topThis script and its ancestors are shipped on the book's CD, as that portable converter tool I was looking for. To convert all examples files in the tree to Unix line-terminator format, simply copy the entire PP2E examples tree to some "examples" directory on your hard drive, and type these two commands in a shell: cd examples/PP2E python PyTools/fixeoln_all.py tounix Of course, this assumes Python is already installed (see the CD's README file for details; see http://examples.oreilly.com/python2), but will work on almost every platform in use today.[4] To convert back to DOS, just replace "tounix" with "todos" and rerun. I ship this tool with a training CD for Python classes I teach too; to convert those files, we simply type:
cd Html\Examples python ..\..\Tools\fixeoln_all.py tounix Once you get accustomed to the command lines, you can use this in all sorts of contexts. Finally, to make the conversion easier for beginners to run, the top-level examples directory includes tounix.py and todos.py scripts that can be simply double-clicked in a file explorer GUI; Example 5-5 shows the "tounix" converter. Example 5-5. PP2E\tounix.py#!/usr/local/bin/python ###################################################################### # Run me to convert all text files to UNIX/Linux line-feed format. # You only need to do this if you see odd '\r' characters at the end # of lines in text files in this distribution, when they are viewed # with your text editor (e.g., vi). This script converts all files # at and below the examples root, and only converts files that have # not already been converted (it's okay to run this multiple times). # # Since this is a Python script which runs another Python script, # you must install Python first to run this program; then from your # system command-line (e.g., a xterm window), cd to the directory # where this script lives, and then type "python tounix.py". You # may also be able to simply click on this file's icon in your file # system explorer, if it knows what '.py' file are. ###################################################################### import os prompt = """ This program converts all text files in the book examples distribution to UNIX line-feed format. Are you sure you want to do this (y=yes)? """ answer = raw_input(prompt) if answer not in ['y', 'Y', 'yes']: print 'Cancelled' else: os.system('python PyTools/fixeoln_all.py tounix') This script addresses the end user's perception of usability, but other factors impact programmer usability -- just as important to systems that will be read or changed by others. For example, the file, directory, and tree converters are coded in separate script files, but there is no law against combining them into a single program that relies on a command-line arguments pattern to know which of the three modes to run. The first argument could be a mode flag, tested by such a program: if mode == '-one': ... elif mode == '-dir': ... elif mode == '-all: ... That seems more confusing than separate files per mode, though; it's usually much easier to botch a complex command line than to type a specific program file's name. It will also make for a confusing mix of global names, and one very big piece of code at the bottom of the file. As always, simpler is usually better. |
I l@ve RuBoard |