Hack 45. Glean Weblog-Free Google Results

With so many weblogs being indexed by Google, you might worry about too much emphasis on the hot topic of the moment. In this hack, we'll show you how to remove the weblog factor from your Google results.

Weblogs—those frequently updated, link-heavy personal pages—are quite the fashionable thing these days. There are at least 4,000,000 active weblogs across the Internet, covering almost every possible subject and interest. For humans, they're good reading, but for search engines, they're heavenly bundles of fresh content and links galore.

Some people think that the search engine's delight in weblogs slants search results by placing too much emphasis on too small a group of recent rather than evergreen content. As I write, for example, I am the twelfth most important Ben on the Internet, according to Google. This rank comes solely from my weblog's popularity.

This hack searches Google, discarding any results coming from weblogs. It uses the Google Web Services API (http://api.google.com) and the API of Technorati (http://www.technorati.com/members), an excellent interface to David Sifry's weblog data-tracking tool. Both APIs require keys, available from the URLs mentioned.

Finally, you'll need a simple HTML page with a form that passes a text query to the parameter q (the query that will run on Google), something like this:

<form action="googletech.cgi" method="POST">

Your query: <input type="text" name="q">

<input type="submit" name="Search!" value="Search!">

</form>

Save the form as googletech.html.

2.27.1. The Code

Save the following code ["How to Run the Hacks" in the Preface] to a file called googletech.cgi.

You'll need the XML::Simple and SOAP::Lite Perl modules to run this hack.

#!/usr/bin/perl -w

# googletech.cgi

# Getting Google results

# without getting weblog results.

use strict;

use SOAP::Lite;

use XML::Simple;

use CGI qw(:standard);

use HTML::Entities ( );

use LWP::Simple qw(!head);

     

my $technoratikey = "insert technorati key here";

my $googlekey = "insert google key here";

     

# Set up the query term

# from the CGI input.

my $query = param("q");

     

# Initialize the SOAP interface and run the Google search.

my $google_wdsl = "http://api.google.com/GoogleSearch.wsdl";

my $service = SOAP::Lite->service->($google_wdsl);

     

# Start returning the results page;

# do this now to prevent timeouts.

my $cgi = new CGI;

     

print $cgi->header( );

print $cgi->start_html(-title=>'Blog Free Google Results');

print $cgi->h1('Blog Free Results for '. "$query");

print $cgi->start_ul( );

     

# Go through each of the results.

foreach my $element (@{$result->{'resultElements'}}) {

     

    my $url = HTML::Entities::encode($element->{'URL'});

     

    # Request the Technorati information for each result.

    my $technorati_result = get("http://api.technorati.com/bloginfo?".

                                "url=$url&key=$technoratikey");

     

    # Parse this information.

    my $parser = new XML::Simple;

    my $parsed_feed = $parser->XMLin($technorati_result);

     

    # If Technorati considers this site to be a weblog,

    # go onto the next result. If not, display it, and then go on.

    if ($parsed_feed->{document}{result}{weblog}{name}) { next; }

    else {

        print $cgi-> i('<a href="'.$url.'">'.$element->{title}.'</a>');

        print $cgi-> l("$element->{snippet}");

    }

}

print $cgi -> end_ul( );

print $cgi->end_html;

Let's step through the meaningful bits of this code. First comes pulling in the query from Google. Notice the 10 in the doGoogleSearch; this is the number of search results requested from Google. You should try to set this as high as Google will allow whenever you run the script; otherwise, you might find that searching for terms that are extremely popular in the weblogging world does not return any results at all, having been rejected as originating from a blog.

Since we're about to make a web services call for every one of the returned results, which might take a while, we want to start returning the results page now; this helps prevent connection timeouts. As such, we spit out a header using the CGI module, and then jump into our loop.

We then get to the final part of our code: actually looping through the search results returned by Google and passing the HTML-encoded URL to the Technorati API as a get request. Technorati will then return its results as an XML document.

Be careful that you do not run out of Technorati requests. As I write this, Technorati is offering 500 free requests a day, which, with this script, is around 50 searches. If you make this script available to your web site audience, you will soon run out of Technorati requests. One possible workaround is forcing the user to enter her own Technorati key. You can get the user's key from the same form that accepts the query. See the "Hacking the Hack" section for a means of doing this.

Parsing this result is a matter of passing it through XML::Simple. Since Technorati returns only an XML construct containing name when the site is thought to be a weblog, we can use the presence of this construct as a marker. If the program sees the construct, it skips to the next result. If it doesn't, the site is not thought to be a weblog by Technorati and we display a link to it, along with the title and snippet (when available) returned by Google.

2.27.2. Running the Hack

Point your browser at the form googletech.html.

2.27.3. Hacking the Hack

As mentioned previously, this script can burn through your Technorati allowances rather quickly under heavy use. The simplest way of solving this is to force the end user to supply his own Technorati key. First, add a new input to your HTML form for the user's key:

Your query: <input type="text" name="key">

Then, suck in the user's key as a replacement to your own:

# Set up the query term

# from the CGI input.

my $query = param("q");

$technoratikey = param("key");

Ben Hammersley

< Day Day Up >