9.8. A Note on Spidering and Scraping

Some small share of the hacks in this book involve spidering, or meandering through sites and scraping data from their web pages to be used outside of their intended context. Given that we have the Google API at our disposal, why then do we resort at times to spidering and scraping?

The main reason is simply that you can't gain access to everything Google through the API. While it nicely serves the purposes of searching the Web programmatically, the API (at the time of this writing) doesn't go any further than Google's main web search index. And it's even limited in what you can pull from the index. You can't do a phonebook search, trawl Google News, leaf through Google Catalogs, or interact in any way with any of Google's other specialty search properties.

So, while Google provides a good start in its API, there are more often than not situations in which you can't get to the Google data that you're most interested in. Not to mention combining what you can get through the Google API with data from other sites without such a convenience. That's where spidering and scraping comes in.

That said, there are a few things that you need to keep in mind when resorting to scraping:

Scrapers are brittle: The shelf life of a scraper is only as long as the page it is scraping remains formatted in about the same manner. When the page changes, your scraper can—and most likely will—break.
Tread lightly: Tread lightly, taking only as much as you need and no more. If all you need is the data from the page that you already have open in your browser, save the source and scrape that.
Maximize your effectiveness: Make the most out of every page you scrape. Rather than hitting Google again and again for the next 10 results and the next 10, set your preferences ["Setting Preferences" in Chapter 1] so that you get all you can on a single page. For instance, set your preferred number of results to 100 rather than the default 10.
Mind the terms of service: It might be tempting to go one step further and create programs that automate retrieving and scraping, but you're more likely to tread on the toes of the site owner (Google or otherwise) and be asked to leave or simply locked out.

So use the API whenever you can, scrape only when you absolutely must, and mind your p's and q's when fiddling about with other people's data.

< Day Day Up >