9.6. Understanding the Google API Query
The core of a Google application is the query. Without the query,
there's no Google data, and without that, you
don't have much of an application. Because of its
importance, it's worth taking a little time to look
into the anatomy of a typical query.
9.6.1. Query Essentials
The command in a typical Perl-based Google API application that sends
a query to Google looks like this:
my $results = $google_search ->
doGoogleSearch(
key, query, start, maxResults,
filter, restrict, safeSearch, lr,
ie, oe
);
Usually, the items within the parentheses are variables, numbers, or
Boolean values (true or false).
In the previous example, I've included the names of
the arguments themselves rather than sample values so that you can
see their definitions here:
- key
-
This is where you put your Google API developer's
key. Without a key, the query won't go very far.
- query
-
This is your query, composed of keywords, phrases, and special
syntaxes.
- start
-
Also known as the offset, this integer value
specifies at what result to start counting when determining which 10
results to return. If this number were 16, the
Google API would return results 16-25; if 300,
results 300-309 (assuming, of course, that your query found that many
results). This is known as a zero-based index,
since counting starts at 0, not 1. The first result is result 0, and
the 999th, 998. It's a little odd, admittedly, but
you get used to it quickly—especially if you go on to do a lot
of programming. Acceptable values are 0 to
999 because Google only returns up to a thousand
results for a query.
- maxResults
-
This integer specifies the number of results that you would like the
API to return. The API returns results in batches of up to ten, so
acceptable values are 1 through
10.
- filter
-
You might think that the filter option concerns
the SafeSearch filter for adult content. It doesn't.
This Boolean value (true or
false) specifies whether your results go through
automatic query filtering, removing near-duplicate content (titles
and snippets that are very similar) and multiple (more than two)
results from the same host or site. With filtering enabled, only the
first two results from each host are included in the result set.
- restrict
-
No, restrict doesn't have
anything to do with SafeSearch either. It allows for restricting your
search to one of Google's topical searches or to a
specific country. Google has four topic restricts: U.S. Government
(unclesam), Linux (linux),
Macintosh (mac), and FreeBSD
(bsd). You'll find the complete
country list in the Google Web API documentation. To leave your
search unrestricted, leave this option blank (usually signified by
empty quotation marks, "").
- safeSearch
-
Now here's the SafeSearch filtering option. This
Boolean (true or false)
specifies whether results returned will be filtered for questionable
(read: adult) content.
- lr
-
This stands for language restrict and
it's a bit tricky. Google has a list of languages in
its API documentation to which you can restrict search results, or
you can simply leave this option blank and have no language
restrictions.
There are several ways that you can restrict to language. First, you
can simply include a language code. If you want to restrict results
to English, for example, use lang_en. But you can
also restrict results to more than one language, separating each
language code with a | (pipe), signifying
OR. lang_en|lang_de, then,
constrains results to only those "in English or
German."
You can omit languages from results by prepending them with a
- (minus sign). -lang_en
returns all results but those in English.
- ie
-
This stands for input encoding, allowing you to
specify the character encoding used in the query that
you're feeding the API. Google's
documentation says, "Clients should encode all
request data in UTF-8 and should expect results to be in
UTF-8." In the first iteration of
Google's API program, the Google API documentation
offered a table of encoding options (latin1,
cyrillic, etc.) but now everything is
UTF-8. In fact, specifying anything other than
UTF-8 is summarily ignored.
- oe
-
This stands for output encoding. As with input
encoding, everything's UTF-8.
9.6.2. A Sample
Enough with the placeholders; what does an actual query look like?
Take, for example, a query that uses variables for the key and the
query, requests 10 results starting at result number 100 (actually
the 101st result), and specifies filtering and SafeSearch be turned
on. That query in Perl would look like this:
my $results = $google_search ->
doGoogleSearch(
$google_key, $query, 100, 10,
"true", "", "true", "",
"utf8", "utf8"
);
Note that the key and query could just as easily have been passed
along as quote-delimited strings:
my $results = $google_search ->
doGoogleSearch(
"12BuCK13mY5h0E/34KN0cK@ttH3Do0R", "+paloentology +dentistry" , 100, 10,
"true", "", "true", "",
"utf8", "utf8"
);
While things appear a little more complex when you start fiddling
with the language and topic restrictions, the core query remains
mostly unchanged; only the values of the options change.
9.6.3. Intersecting Country, Language, and Topic Restrictions
Sometimes you might want to restrict your results to a particular
language in a particular country, or a particular language,
particular country, and particular topic. Now here's
where things start looking a little on the odd side.
The rules are as follows:
Omit something by prepending it with a - (minus
sign). Separate restrictions with a . (period, or full
stop); spaces are not allowed. Specify an OR relationship between two
restrictions with a | (pipe). Group restrictions with parentheses. You can have parentheses within
parentheses—nested parentheses—for
fine-grained control over grouping in your queries.
Let's say you want a query to return results in
French, draw only from Canadian sites, and focus only within the
Linux topic. Your query would look something like this:
my $results = $google_search ->
doGoogleSearch(
$google_key, $query, 100, 10,
"true", "linux.countryCA", "true", "lang_fr",
"utf8", "utf8"
);
For results from Canada or from France, you would use:
"linux.(countryCA|countryFR)"
Or maybe you want results in French, but from anywhere but France:
"linux.(-countryFR)"
For a comprehensive list of restricts, see Section 2.4,
"Restricts," of
APIs_Reference.html, part of the Google API
documentation
9.6.4. Putting Query Elements to Use
You might use the different elements of the query as follows:
- Using SafeSearch
-
If you're building a program that's
for family-friendly use, you'll probably want to
have SafeSearch turned on as a matter of course. But you can also use
it to compare safe and unsafe results. [Hack #35] does just
that. You could create a program that takes a word from a web form
and checks its counts in filtered and unfiltered searches, providing
a naughty rating for the word based on the
counts.
- Setting search result numbers
-
Whether you request 1 or 10 results, you're still
using one of your developer key's daily dose of a
thousand Google Web API queries. Wouldn't it then
make sense to always request 10? Not necessarily; if
you're using only the top result—to bounce the
browser to another page, generate a random query string for a
password, or whatever—you might as well add even the minutest
amount of speed to your application by not requesting results that
you're just going to throw out or ignore.
- Searching different topics
-
With four different specialty topics available for searching through
the Google API, dozens of different languages, and dozens of
different countries, there are thousands of combinations of
topic/language/country restriction that you would work through.
Consider an open source country application. You
could create a list of keywords very specific to open source (such as
linux, perl, etc.) and create a
program that cycles through a series of queries that restricts your
search to an open source topic (such as linux) and
a particular country. So you might discover that
perl was mentioned in France in the
linux topic 15 times, in Germany 20 times, and so
on.
You could also concentrate less on the program itself and more on an
interface to access these variables. How about a form with pull-down
menus that allows you to restrict your searches by continent (instead
of country)? You could specify which continent in a variable
that's passed to the query. Or how about an
interface that lets the user specify a topic and cycles through a
list of countries and languages, pulling result counts for each one?
|