Section 11.3. Berkeley DB Interfacing

11.3. Berkeley DB Interfacing

Python comes with the bsddb package, which wraps the Berkeley Database (also known as BSD DB) library if that library is installed on your system and your Python installation is built to support it. With the BSD DB library, you can create hash, binary-tree, or record-based files that generally behave like persistent dictionaries. On Windows, Python includes a port of the BSD DB library, thus ensuring that module bsddb is always usable. To download BSD DB sources, binaries for other platforms, and detailed documentation on BSD DB itself, see http://www.sleepycat.com.

11.3.1. Simplified and Complete BSD DB Python Interfaces

Module bsddb itself provides a simplified, backward-compatible interface to a subset of BSD DB's functionality, as covered by the Python online documentation at http://www.python.org/doc/2.4/lib/module-bsddb.html. However, the standard Python library also comes with many modules in package bsddb, starting with bsddb.db. This set of modules closely mimics BSD DB's current rich, complex functionality and interfaces, and is documented at http://pybsddb.sourceforge.net/bsddb3.html. At this URL, you'll see the package documented under the slightly different name bsddb3, which is the name of a package you can separately download and install even on very old versions of Python. However, to use the version of this package that comes as part of the Python standard library, what you need to import are modules named bsddb.db and the like, not bsddb3.db and the like. Apart from this naming detail, the Sourceforge documentation fully applies to the modules in package bsddb in the Python standard library (db, dbshelve, dbtables, dbutil, dbobj, dbrecio).

Entire books can be (and have been) written about the full interface to BSD DB and its functionality, so I do not cover this rich, complete, and complex interface in this book. (If you need to exploit BSD DB's complete functionality, I suggest, in addition to studying the URLs mentioned above, the book Berkeley DB, by Sleepycat Software [New Riders].) However, in Python you can also access a small but important subset of BSD DB's functionality in a much simpler way, through the simplified interface provided by module bsddb and covered in the following.

11.3.2. Module bsddb

Module bsddb supplies three factory functions: btopen, hashopen, and rnopen.

btopen, hashopen, rnopen
btopen(filename,flag='r',*many_other_optional_arguments) hashopen(filename,flag='r',*many_other_optional_arguments) rnopen(filename,flag='r',*many_other_optional_arguments)

btopen opens or creates the binary tree file named by filename (a string that is any path to a file, not just a name), and returns a BTree object to access and manipulate the file. Argument flag has the same values and meaning as for anydbm.open. Other arguments indicate options that allow fine-grained control, but are rarely used.

hashopen and rnopen work the same way, but open or create hash format and record format files, respectively, returning objects of type Hash and Record. hashopen is generally the fastest format and makes sense when you are using keys to look up records. However, if you also need to access records in sorted order, use btopen;if you need to access records in the same order in which you originally wrote them, use rnopen. Using hashopen does not keep records in order in the file.

An object b of any of the types BTree, Hash, and Record can be indexed as a mapping, as long as keys and values are strings. Further, b also supports sequential access through the concept of a current record. b supplies the following methods.

close
b.close( )

Closes b. Call no other method on b after b.close( ).

first
b.first( )

Sets b's current record to the first record and returns a pair (key,value) for the first record. The order of records is arbitrary, except for BTree objects, which ensure records are sorted in alphabetical order of key. b.first( ) raises KeyError if b is empty.

has_key
b.has_key(key)

Returns true if string key is a key in b; otherwise, returns False.

keys
b.keys( )

Returns the list of b's key strings. The order is arbitrary, except for BTree objects, which return keys in alphabetical order.

last
b.last( )

Sets b's current record to the last record and returns a pair (key,value) for the last record. Type Hash does not supply method last.

next
b.next( )

Sets b's current record to the next record and returns a pair (key,value) for the next record. b.next( ) raises KeyError if b has no next record.

previous
b.previous( )

Sets b's current record to the previous record and returns a pair (key,value) for the previous record. Type Hash does not supply method previous.

set_location
b.set_location(key)

Sets b's current record to the item with string key key and returns a pair (key,value). If key is not a key in b, and b is of type BTree, b.set_location(key) sets b's current record to the item whose key is the smallest key larger than key and returns that key/value pair. For other object types, set_location raises KeyError if key is not a key in b.

11.3.3. Examples of Berkeley DB Use

The Berkeley DB is suited to tasks similar to those for which DBM-like files are appropriate. Indeed, anydbm uses dbhash, the DBM-like interface to BSD DB, to create new DBM-like files. In addition, BSD DB allows other file formats when you use module bsddb directly. The binary tree format is not as fast as the hashed format for keyed access, but excellent when you also need to access keys in alphabetical order.

The following example handles the same task as the DBM example shown earlier, but uses bsddb rather than anydbm:

import fileinput, os, bsddb wordPos = {  }
sep = os.pathsep for line in fileinput.input( ):
    pos = '%s%s%s'%(fileinput.filename( ), sep, fileinput.filelineno( ))
    for word in line.split( ):
        wordPos.setdefault(word,[  ]).append(pos)
btOut = bsddb.btopen('btindex','n')
sep2 = sep * 2
for word in wordPos:
    btOut[word] = sep2.join(wordPos[word])
btOut.close( )

The differences between this example and the DBM one are minimal: writing a new binary tree format file with bsddb is basically the same task as writing a new DBM-like file with anydbm. Reading back the data using bsddb.btopen('btindex') rather than anydbm.open('indexfile') is also similar. To illustrate the extra features of binary trees regarding access to keys in alphabetical order, let's tackle a slightly more general task. The following example treats its command-line arguments as specifying the beginning of words, and prints the lines in which any word with such a beginning appears:

import sys, os, bsddb, linecache btIn = bsddb.btopen('btindex')
sep = os.pathsep sep2 = sep * 2

for word in sys.argv[1:]:
    key, pos = btIn.set_location(word)
    if not key.startswith(word):
 sys.stderr.write('Word-start %r not found in index file\n' % word)
    while key.startswith(word):
        places = pos.split(sep2)
        for place in places:
            fname, lineno = place.split(sep)
            print "%r occurs in line %s of file %s:" % (word,lineno,fname)
            print linecache.getline(fname, int(lineno)),
        try: key, pos = btIn.next( )
        except IndexError: break

This example exploits the fact that btIn.set_location sets btIn's current position to the smallest key larger than word, when word itself is not a key in btIn. When word is the start of a word, and the keys are words, this means that set_location sets the current position to the first word, in alphabetical order, that begins with word. The tests with key.startswith(word) checks that we're still scanning words with that beginning, and terminate the while loop when that is no longer the case. We perform the first such test in an if statement, right before the while, because we want to single out the case where no word at all starts with the desired beginning, and output an error message in that specific case.