9.7. Understanding the Google API Response
While the Google API grants you programmatic access to
Google's Web index, it doesn't
provide all the functionality available through the Google.com web
site's search interface.
9.7.1. Can Do
The Google API, in addition to simple keyword queries, supports the
following ["Special Syntaxes" in
Chapter 1]:
site:
daterange:
intitle:
inurl:
allintext:
allinlinks:
filetype:
info:
link:
related:
cache:
9.7.2. Can't Do
The Google API does not support these special syntaxes:
phonebook:
rphonebook:
bphonebook:
stocks:
While queries of this sort provide no individual results, aggregate
result data is sometimes returned and can prove rather useful.
googly.php [Hack #96], for
instance, displays the number of results
(estimatedTotalResultsCount).
9.7.3. The 10-Result Limit
While searches through the standard Google.com home page can be tuned
["Setting Preferences" in Chapter 1] to return 10, 20, 30, 50, or 100 results
per page, the Google Web API limits the number to 10 per query. This
doesn't mean, mind you, that the rest are not
available to you, but it takes a wee bit of creative programming
entailing looping through results, 10 at a time [Hack #95] .
9.7.4. What's in the Results
The Google API provides both aggregate and per-result data in its
result set.
9.7.4.1 Aggregate data
The aggregate data, information on the query itself and on the kinds
and number of results that query turned up, consists of:
- <documentFiltering>
-
A Boolean (true/false) value
specifying whether or not results were filtered for very similar
results or those that come from the same web host.
- <searchComments>
-
Any commentary (e.g., a note about stop words being removed) Google
might throw in that would usually be displayed just beneath the
search box on a typical Google results page.
- <estimatedTotalResultsCount>
-
An estimate of how many results might be found for your search in the
Google index. This number may vary from invocation to invocation,
moment to moment—thus the
"estimated" proviso.
- <estimateIsExact>
-
Google may sometimes be sure of its
estimatedTotalResultsCount, in which case
estimateIsExact will be set to
TRue.
- <resultElements>
-
The individual results themselves, returned as an array.
- <searchQuery>
-
Your Google query, right back at you.
- <startIndex>
-
The index of the first result in the current array of results.
Assuming your query asked for a start of
0, the first result will have a
startIndex of 1. If you asked
for a start of 25,
startIndex would be 26. Yes, I
know it's confusing that start is
zero-based, while startIndex is one-based, but
that's the way the cookie crumbles,
I'm afraid.
- <endIndex>
-
The index of the last result in the current array of results. This is
always whatever you set as start +
maxResults in your query, unless the total is
greater than the number of
estimatedTotalResultsCount, in which case it is
simply estimatedTotalResultsCount.
- <searchTips>
-
May provide suggestions on better using Google, suitable for
displaying to the end user.
- <directoryCategories>
-
A list of directory categories, if any, associated with the query
- <searchTime>
-
The time spent by the Google server (in seconds) on your search.
9.7.4.2 Individual search result data
The "guts" of a search
result—the URLs, page titles, and snippets—are returned
in a <resultElements> list. Each result
consists of the following elements:
- <summary>
-
The Google Directory summary, if available.
- <URL>
-
The search result's URL, consistently starts with
http://.
- <snippet>
-
A brief excerpt of the page with query terms highlighted in bold
(HTML <b> </b> tags).
- <title>
-
The page title in HTML.
- <cachedSize>
-
The size in kilobytes (K) of the Google-cached version of the page,
if available.
- <relatedInformationPresent>
-
If set to 1, means a related:
search on the current result's URL will turn up
something of use.
- <hostName>
-
When you set filter to TRue in
your query, only two results from the same hostname are included in
your set of results. In the second of these results,
hostName is set to the host from which the result
came.
- <directoryTitle>
-
The title under which this result appears in the Google Directory
(http://directory.google.com,
a.k.a. the Open Directory Project) if it is in the directory at all.
- <directoryCategory>
-
The Google Directory category, if any, in which
you'll find this result.
<directoryCategory> consists of
<fullViewableName>, the name given to the
category itself, and <specialEncoding>, any
special encoding assigned to the directory category at hand.
You no doubt notice the conspicuous absence of PageRank. Google does
not make PageRank available through anything but the official Google
Toolbar [Hack #60] .
You can get a general idea of a page's popularity by
looking over the popularity bars in the Google
Directory.
|