I l@ve RuBoard

16.5 Shelve Files

Pickling allows you to store arbitrary objects on files and file-like objects, but it's still a fairly unstructured medium; it doesn't directly support easy access to members of collections of pickled objects. Higher-level structures can be added, but they are not inherent:

You can sometimes craft your own higher-level pickle file organizations with the underlying filesystem (e.g., you can store each pickled object in a file whose name uniquely identifies the object), but such an organization is not part of pickling itself and must be manually managed.
You can also store arbitrarily large dictionaries in a pickled file and index them by key after they are loaded back into memory; but this will load the entire dictionary all at once when unpickled, not just the entry you are interested in.

Shelves provide some structure to collections of pickled objects. They are a type of file that stores arbitrary Python objects by key for later retrieval, and they are a standard part of the Python system. Really, they are not much of a new topic -- shelves are simply a combination of DBM files and object pickling:

To store an in-memory object by key, the shelve module first serializes the object to a string with the pickle module, and then it stores that string in a DBM file by key with the anydbm module.
To fetch an object back by key, the shelve module first loads the object's serialized string by key from a DBM file with the anydbm module, and then converts it back to the original in-memory object with the pickle module.

Because shelve uses pickle internally, it can store any object that pickle can: strings, numbers, lists, dictionaries, cyclic objects, class instances, and more.

16.5.1 Using Shelves

In other words, shelve is just a go-between; it serializes and deserializes objects so that they can be placed in DBM files. The net effect is that shelves let you store nearly arbitrary Python objects on a file by key, and fetch them back later with the same key. Your scripts never see all this interfacing, though. Like DBM files, shelves provide an interface that looks like a dictionary that must be opened. To gain access to a shelve, import the module and open your file:

import shelve
dbase = shelve.open("mydbase")

Internally, Python opens a DBM file with name mydbase, or creates it if it does not yet exist. Assigning to a shelve key stores an object:

dbase['key'] = object

Internally, this assignment converts the object to a serialized byte-stream and stores it by key on a DBM file. Indexing a shelve fetches a stored object:

value = dbase['key']

Internally, this index operation loads a string by key from a DBM file and unpickles it into an in-memory object that is the same as the object originally stored. Most dictionary operations are supported here, too:

len(dbase)      # number of items stored
dbase.keys(  )    # stored item key index

And except for a few fine points, that's really all there is to using a shelve. Shelves are processed with normal Python dictionary syntax, so there is no new database API to learn. Moreover, objects stored and fetched from shelves are normal Python objects; they do not need to be instances of special classes or types to be stored away. That is, Python's persistence system is external to the persistent objects themselves. Table 16-2 summarizes these and other commonly used shelve operations.

Table 16-2. Shelve File Operations

Python Code

Action

Description

import shelve

Import

Get dbm, gdbm ,... whatever is installed

file = shelve.open('filename')

Open

Create or open an existing DBM file

file['key'] = anyvalue

Store

Create or change the entry for key

value = file['key']

Fetch

Load the value for entry key

count = len(file)

Size

Return the number of entries stored

index = file.keys( )

Index

Fetch the stored keys list

found = file. has_key('key')

Query

See if there's an entry for key

del file['key']

Delete

Remove the entry for key

file.close( )

Close

Manual close, not always needed

Because shelves export a dictionary-like interface, too, this table is almost identical to the DBM operation table. Here, though, the module name anydbm is replaced by shelve, open calls do not require a second c argument, and stored values can be nearly arbitrary kinds of objects, not just strings. You still should close shelves explicitly after making changes to be safe, though; shelves use anydbm internally and some underlying DBMs require closes to avoid data loss or damage.

16.5.2 Storing Built-in Object Types

Let's run an interactive session to experiment with shelve interfaces:

% python
>>> import shelve
>>> dbase = shelve.open("mydbase")
>>> object1 = ['The', 'bright', ('side', 'of'), ['life']]
>>> object2 = {'name': 'Brian', 'age': 33, 'motto': object1}
>>> dbase['brian']  = object2
>>> dbase['knight'] = {'name': 'Knight', 'motto': 'Ni!'}
>>> dbase.close(  )

Here, we open a shelve and store two fairly complex dictionary and list data structures away permanently by simply assigning them to shelve keys. Because shelve uses pickle internally, almost anything goes here -- the trees of nested objects are automatically serialized into strings for storage. To fetch them back, just reopen the shelve and index:

% python
>>> import shelve
>>> dbase = shelve.open("mydbase")
>>> len(dbase)                            # entries
2

>>> dbase.keys(  )                          # index
['knight', 'brian']

>>> dbase['knight']                       # fetch
{'motto': 'Ni!', 'name': 'Knight'}

>>> for row in dbase.keys(  ):
...     print row, '=>'
...     for field in dbase[row].keys(  ):
...         print '  ', field, '=', dbase[row][field]
...
knight =>
   motto = Ni!
   name = Knight
brian =>
   motto = ['The', 'bright', ('side', 'of'), ['life']]
   age = 33
   name = Brian

The nested loops at the end of this session step through nested dictionaries -- the outer scans the shelve, and the inner scans the objects stored in the shelve. The crucial point to notice is that we're using normal Python syntax both to store and to fetch these persistent objects as well as to process them after loading.

16.5.3 Storing Class Instances

One of the more useful kinds of objects to store in a shelve is a class instance. Because its attributes record state and its inherited methods define behavior, persistent class objects effectively serve the roles of both database records and database-processing programs. For instance, consider the simple class shown in Example 16-2, which is used to model people.

Example 16-2. PP2E\Dbase\person.py (version 1)

# a person object: fields + behavior

class Person:
    def __init__(self, name, job, pay=0):
        self.name = name
        self.job  = job
        self.pay  = pay               # real instance data
    def tax(self):
        return self.pay * 0.25        # computed on call
    def info(self):
        return self.name, self.job, self.pay, self.tax(  )

We can make some persistent objects from this class by simply creating instances as usual, and storing them by key on an opened shelve:

C:\...\PP2E\Dbase>python
>>> from person import Person
>>> bob   = Person('bob', 'psychologist', 70000)
>>> emily = Person('emily', 'teacher', 40000)
>>>
>>> import shelve
>>> dbase = shelve.open('cast')          # make new shelve
>>> for obj in (bob, emily):             # store objects
>>>     dbase[obj.name] = obj            # use name for key
>>> dbase.close(  )                        # need for bsddb

When we come back and fetch these objects in a later Python session or script, they are recreated in memory as they were when they were stored:

C:\...\PP2E\Dbase>python
>>> import shelve
>>> dbase = shelve.open('cast')          # reopen shelve
>>>
>>> dbase.keys(  )                         # both objects are here
['emily', 'bob']
>>> print dbase['emily']
<person.Person instance at 799940>
>>>
>>> print dbase['bob'].tax(  )             # call: bob's tax
17500.0

Notice that calling Bob's tax method works even though we didn't import the Person class here. Python is smart enough to link this object back to its original class when unpickled, such that all the original methods are available through fetched objects.

16.5.4 Changing Classes of Stored Objects

Technically, Python reimports a class to recreate its stored instances as they are fetched and unpickled. Here's how this works:

Store: When Python pickles a class instance to store it in a shelve, it saves the instance's attributes plus a reference to the instance's class. Really, Python serializes and stores the instance's __dict__ attribute dictionary along with source file information for the class's module.
Fetch: When Python unpickles a class instance fetched from a shelve, it recreates the instance object in memory by reimporting the class and assigning the saved attribute dictionary to a new empty instance of the class.

The key point in this is that the class itself is not stored with its instances, but is instead reimported later when instances are fetched. The upshot is that by modifying external classes in module files, we can change the way stored objects' data is interpreted and used without actually having to change those stored objects. It's as if the class is a program that processes stored records.

To illustrate, suppose the Person class from the previous section was changed to the source code in Example 16-3.

Example 16-3. PP2E\Dbase\person.py (version 2)

# a person object: fields + behavior
# change: the tax method is now a computed attribute

class Person:
    def __init__(self, name, job, pay=0):
        self.name = name
        self.job  = job
        self.pay  = pay               # real instance data
    def __getattr__(self, attr):      # on person.attr
        if attr == 'tax':
            return self.pay * 0.30    # computed on access
        else:
            raise AttributeError      # other unknown names
    def info(self):
        return self.name, self.job, self.pay, self.tax

This revision has a new tax rate (30%), introduces a __getattr__ qualification overload method, and deletes the original tax method. Tax attribute references are intercepted and computed when accessed:

C:\...\PP2E\Dbase>python
>>> import shelve
>>> dbase = shelve.open('cast')      # reopen shelve
>>>
>>> print dbase.keys(  )               # both objects are here
['emily', 'bob']
>>> print dbase['emily']
<person.Person instance at 79aea0>
>>>
>>> print dbase['bob'].tax           # no need to call tax(  )
21000.0

Because the class has changed, tax is now simply qualified, not called. In addition, because the tax rate was changed in the class, Bob pays more this time around. Of course, this example is artificial, but when used well, this separation of classes and persistent instances can eliminate many traditional database update programs -- in most cases, you can simply change the class, not each stored instance, for new behavior.

16.5.5 Shelve Constraints

Although shelves are generally straightforward to use, there are a few rough edges worth knowing about.

16.5.5.1 Keys must be strings

First of all, although they can store arbitrary objects, keys must still be strings. The following fails, unless you convert the integer 42 to string "42" manually first:

dbase[42] = value      # fails, but str(42) will work

This is different from in-memory dictionaries, which allow any immutable object to be used as a key, and derives from the shelve's use of DBM files internally.

16.5.5.2 Objects are only unique within a key

Although the shelve module is smart enough to detect multiple occurrences of a nested object and recreate only one copy when fetched, this only holds true within a given slot:

dbase[key] = [object, object]    # okay: only one copy stored and fetched

dbase[key1] = object 
dbase[key2] = object             # bad?: two copies of object in the shelve

When key1 and key2 are fetched, they reference independent copies of the original shared object; if that object is mutable, changes from one won't be reflected in the other. This really stems from the fact the each key assignment runs an independent pickle operation -- the pickler detects repeated objects but only within each pickle call. This may or may not be a concern in your practice and can be avoided with extra support logic, but an object can be duplicated if it spans keys.

16.5.5.3 Updates must treat shelves as fetch-modify-store mappings

Because objects fetched from a shelve don't know that they came from a shelve, operations that change components of a fetched object only change the in-memory copy, not the data on a shelve:

dbase[key].attr = value   # shelve unchanged

To really change an object stored on a shelve, fetch it into memory, change its parts, and then write it back to the shelve as a whole by key assignment:

object = dbase[key]       # fetch it
object.attr = value       # modify it
dbase[key] = object       # store back-shelve changed

16.5.5.4 Concurrent updates not allowed

As we learned near the end of Chapter 14, the shelve module does not currently support simultaneous updates. Simultaneous readers are okay, but writers must be given exclusive access to the shelve. You can trash a shelve if multiple processes write to it at the same time, and this is a common potential in things like CGI server-side scripts. If your shelves may be hit by multiple processes, be sure to wrap updates in calls to the fcntl.flock built-in we explored in Chapter 14.

16.5.5.5 Pickler class constraints

In addition to these shelve constraints, storing class instances in a shelve adds a set of additional rules you need to be aware of. Really, these are imposed by the pickle module, not shelve, so be sure to follow these if you store class objects with pickle directly, too.

Classes must be importable: The Python pickler stores instance attributes only when pickling an instance object, and reimports the class later to recreate the instance. Because of that, the classes of stored objects must be importable when objects are unpickled -- they must be coded unnested at the top level of a module file visible on PYTHONPATH. Further, they must be associated with a real module when instances are pickled, not a top-level script (with module name __main__ ), and you need to be careful about moving class modules after instances are stored. When an instance is unpickled, Python must find its class's module on PYTHONPATH using the original module name (including any package path prefixes), and fetch the class from that module using the original class name. If the module or class has been moved or renamed, it might not be found.
Class changes must be backwards-compatible: Although Python lets you change a class while instances of it are stored on a shelve, those changes must be backwards-compatible with the objects already stored. For instance, you cannot change the class to expect an attribute not associated with already-stored persistent instances unless you first manually update those stored instances or provide extra conversion protocols on the class.

In a prior Python release, persistent object classes also had to either use constructors with no arguments, or they had to provide defaults for all constructor arguments (much like the notion of a C++ copy constructor). This constraint was dropped as of Python 1.5.2 -- classes with non-defaulted constructor arguments now work fine in the pickling system.^[2]

^[2] Subtle thing: internally, Python now avoids calling the class to recreate a pickled instance and instead simply makes a class object generically, inserts instance attributes, and sets the instance's __class__ pointer to the original class directly. This avoids the need for defaults, but it also means that the class __init__ constructors are no longer called as objects are unpickled, unless you provide extra methods to force the call. See the library manual for more details, and see the pickle module's source code (pickle.py in the source library) if you're curious about how this works. Better yet, see the formtable module listed ahead in this chapter -- it does something very similar with __class__ links to build an instance object from a class and dictionary of attributes, without calling the class's __init__ constructor. This makes constructor argument defaults unnecessary in classes used for records browsed by PyForm, but it's the same idea.

16.5.5.6 Other persistence limitations

In addition to the above constraints, keep in mind that files created by an underlying DBM system are not necessarily compatible with all possible DBM implementations. For instance, a file generated by gdbm may not be readable by a Python with another DBM module installed, unless you explicitly import gdbm instead of anydbm (assuming it's installed at all). If DBM file portability is a concern, make sure that all the Pythons that will read your data use compatible DBM modules.

Finally, although shelves store objects persistently, they are not really object-oriented database systems (OODBs). Such systems also implement features like object decomposition and delayed ("lazy") component fetches, based on generated object IDs: parts of larger objects are loaded into memory only as they are accessed. It's possible to extend shelves to support such features, but you don't need to -- the Zope system described in Chapter 15, includes an implementation of a more complete OODB system. It is constructed on top of Python's built-in persistence support, but offers additional features for advanced data stores. See the previous chapter for information and links.

I l@ve RuBoard