< Day Day Up > |
9.8. A Note on Spidering and ScrapingSome small share of the hacks in this book involve spidering, or meandering through sites and scraping data from their web pages to be used outside of their intended context. Given that we have the Google API at our disposal, why then do we resort at times to spidering and scraping? The main reason is simply that you can't gain access to everything Google through the API. While it nicely serves the purposes of searching the Web programmatically, the API (at the time of this writing) doesn't go any further than Google's main web search index. And it's even limited in what you can pull from the index. You can't do a phonebook search, trawl Google News, leaf through Google Catalogs, or interact in any way with any of Google's other specialty search properties. So, while Google provides a good start in its API, there are more often than not situations in which you can't get to the Google data that you're most interested in. Not to mention combining what you can get through the Google API with data from other sites without such a convenience. That's where spidering and scraping comes in. That said, there are a few things that you need to keep in mind when resorting to scraping:
So use the API whenever you can, scrape only when you absolutely must, and mind your p's and q's when fiddling about with other people's data. |
< Day Day Up > |