Hack 42. Spider the Web with Firefox

Save lots and lots of web pages to your local disk without hassle.

If a web page is precious, a simple bookmark might not be enough. You might want to keep a copy of the page locally. This hack explains how to save lots of things at once with Firefox. Usually this kind of thing is done by a web spider. A web spider is any program that poses as a user and navigates through pages, following links.

For heavy-duty web site spidering done separately from Firefox, Free Download Manager (http://www.freedownloadmanager.org) for Windows and wget(1) for Unix/Linux (usually preinstalled) are recommended.

4.11.1. Save One Complete Page

The days of HTML-only page capture are long gone. It's easy to capture a whole web page now.

4.11.1.1 Saving using Web Page Complete

To save a whole web page, choose FileSave Page As... and make sure that "Save as type:" is set to Web Page Complete. If you change this option, that change will become the future default only if you complete the save action while you're there. If you back out without saving, the change will be lost. When the page is saved, an HTML document and a folder are created in the target directory. The folder contains all the ancillary information about the page, and the page's content is adjusted so that image, frame, and stylesheet URLs are relative to that folder. So, the saved page is not a perfect copy of the original HTML. There are two small oddities to watch out for:

On Windows, Windows Explorer has special smarts that sometimes treat the HTML page and folder as one unit when file manipulation is done. If you move the HTML page between windows, you might see the matching folder move as well. This is normal Windows behavior.
If the page refers to stylesheets on another web site using a <link> tag, these stylesheets will not be saved. As a result, Firefox will attempt to download these stylesheets each time the saved HTML copy is displayed. This will take forever if no Internet connection is present. The only way to stop this delay is to choose FileWork Offline when viewing such files.

4.11.1.2 Saving using Print

One problem with saved web pages is that the copy is just a snapshot in time. It's difficult to tell from a plain HTML document when it was captured. A common technique that solves this problem and keeps all the HTML content together is to use Acrobat Distiller, which comes with the commercial (nonfree) version of Acrobat Reader.

When Distiller is installed, it also installs two printer drivers. The important one is called Acrobat PDFWriter. It can convert an HTML page to a single date-stamped PDF file. Although such PDF files are large and occasionally imperfect, the process of capturing web pages this way is addictive in its simplicity, and the files are easy to view later with the free (or full) Reader. The only drawback is that PDF files can be quite large compared to HTML.

To save web pages as PDF files, choose FilePrint... from the Firefox menu, choose Adobe PDFWriter as the device, and select the Print to File checkbox. Then, go ahead and print; you'll be asked where to save the PDF results.

4.11.2. Save Lots of Pages

To save lots of Web pages, use an extension. The Download Tools category at http://update.mozilla.org lists a number of likely candidates. Here are a few of them.

4.11.2.1 Down Them All

The Down Them All extension (http://downthemall.mozdev.org), invoked from the context menu, skims the current page for foreign information and saves everything it finds to local disk. It effectively acts as a two-tier spider. It saves all images linked from the current page, as well as all pages linked to from the current page. It doesn't save stylesheets or images embedded in linked-to pages.

Two of the advantages of Down Them All are that it can be stopped partway through, and download progress is obvious while it is underway.

4.11.2.2 Magpie

The Magpie extension (http://www.bengoodger.com/software/tabloader/) provides a minimal interface that takes a little getting used to. For spidering purposes, the context menu items that Magpie adds are not so useful. The special keystroke Ctrl-Shift-S, special URLs, and the Magpie configuration dialog box are the key spidering features.

To find the Magpie configuration system, choose ToolsExtensions, select the Magpie extension, and then click Options. Figure 4-21 shows the resulting dialog box.

Figure 4-21. Magpie configuration window

Using this dialog box, you can set one of two options for Ctrl-Shift-S (detailed in the radio group at the top). Everything else in this window has to do with folder names to be used on local disk.

The first time you press Ctrl-Shift-S, Firefox asks you for the name of an existing folder in which to put all the Magpie downloads. After that, it never asks again.

By default, Ctrl-Shift-S saves all tabs to the right of the current one and then closes those tabs. That is one-tier spidering of one or more web pages, plus two-tier spidering for any linked images in the displayed pages.

If the "Linked from the current page..." option is selected instead, then Magpie acts like Down Them All, scraping all images (or other specified content) linked from the current page.

In both cases, Magpie generates a file with the name YYYY-MM-DD HH-MM-SS (a datestamp) in the target directory and stuffs all the spidered content in there.

The other use of Magpie is to download collections of URLs that have similar names. This is like specifying a keyword bookmark, except that only numbers can be used as parameters and they must be hand specified as ranges. For example, suppose these URLs are required:

http://www.example.com/section1/page3.html
http://www.example.com/section1/page4.html
http://www.example.com/section2/page3.html
http://www.example.com/section2/page4.html

Using the special bkstr: URL scheme (an unofficial convenience implemented by Magpie), these four URLs can be condensed down to a single URL that indicates the ranges required:

bkstr://ww.example.com/section{1-2}/page{3-4}.html

Retrieving this URL retrieves the four pages listed directly to disk, with no display. This process is also a one-tier spidering technology, so retrieved pages will not be filled with any images to which they might refer. This technique is most useful for retrieving a set of images from a photo album or a set of documents (chapters, minutes, diary entries) from an index page.

4.11.2.3 Slogger

Rather than saving page content on demand, the Slogger extension (http://www.kenschutte.com/firefoxext/) saves every page you ever display. After the initial install, the extension does nothing immediately. It's only when you highlight it in the Extensions Manager, click the Options box, and choose a default folder for the logged content that it starts to fill the disk. The configuration options are numerous, and Perl-like syntax options make both the names of the logged files and the content of the log audit trail highly customizable.

Since Slogger saves only what you see, how well it spiders depends on how deeply you navigate through a web site's hierarchy. Note that Mozilla's history mechanism works the same way as Slogger, except that it stores downloaded web pages unreadably in the disk cache (if that's turned on), and that disk cache can be flushed or overwritten if it fills up.

4.11.3. Learning from the Master

Bob Clary's CSpider JavaScript library and XUL Spider application are the best free tools available for automating web page navigation from inside web pages. You can read about them here: http://www.bclary.com/2004/07/10/mozilla-spiders.

These tools are aimed at web programmers with a systematic mindset. They are the basis of a suite of web page compatibility and correctness tests. These tools won't let you save anything to disk; instead, they represent a useful starting point for any spidering code that you might want to create yourself.