< Day Day Up > |
Hack 30. Restrict Searches to Top-Level ResultsSeparate out search results by the depth at which they appear in a site. Google's a mighty big haystack in which to find the needle you seek. And there's more, so much more: some experts believe that Google and its ilk index only a bare fraction of the pages available on the Web. Because the Web's growing all the time, researchers have to come up with lots of different tricks to narrow down search results. Tricks and—thanks to the Google API—tools. This hack separates out search results appearing at the top level of a domain from those beneath. Why would you want to do this?
2.12.1. The CodeSave the code as a CGI script ["How to Run the Hacks" in the Preface] named gootop.cgi: #!/usr/local/bin/perl # gootop.cgi # Separates out top-level and sub-level results. # gootop.cgi is called as a CGI with form input. # Your Google API developer's key. my $google_key='insert key here'; # Location of the GoogleSearch WSDL file. my $google_wdsl = "./GoogleSearch.wsdl"; # Number of times to loop, retrieving 10 results at a time. my $loops = 10; use strict; use SOAP::Lite; use CGI qw/:standard *table/; print header( ), start_html("GooTop"), h1("GooTop"), start_form(-method=>'GET'), 'Query: ', textfield(-name=>'query'), ' ', submit(-name=>'submit', -value=>'Search'), end_form( ), p( ); my $google_search = SOAP::Lite->service("file:$google_wdsl"); if (param('query')) { my $list = { 'toplevel' => [], 'sublevel' => [] }; for (my $offset = 0; $offset <= $loops*10; $offset += 10) { my $results = $google_search -> doGoogleSearch( $google_key, param('query'), $offset, 10, "false", "", "false", "", "latin1", "latin1" ); foreach (@{$results->{'resultElements'}}) { push @{ $list->{ $_->{URL} =~ m!://[^/]+/?$! ? 'toplevel' : 'sublevel' } }, p( b($_->{title}||'no title'), br( ), a({href=>$_->{URL}}, $_->{URL}), br( ), i($_->{snippet}||'no snippet') ); } } print h2('Top-Level Results'), join("\n", @{$list->{toplevel}}), h2('Sub-Level Results'), join("\n", @{$list->{sublevel}}); } print end_html; Gleaning a decent number of top-level domain results means throwing out quite a bit. It's for this reason that this script runs the specified query a number of times, as specified by my $loops = 10;, each loop picking up 10 results, some subset being top-level. To alter the number of loops per query, simply change the value of $loops. Realize that each invocation of the script burns through $loops number of queries, so be sparing and don't bump that number up to anything ridiculous; even 100 will eat through a daily allotment in just 10 invocations. The heart of the script, and what differentiates it from your average Google API Perl script [Hack #92], lies in the code that follows. push @{ $list->{ $_->{URL} =~ m!://[^/]+/?$! ? 'toplevel' : 'sublevel' } } What that jumble of characters is scanning for is :// (as in http://) followed by anything other than a / (slash), thereby sifting between top-level finds (e.g., http://www.berkeley.edu/welcome.html) and sublevel results (e.g., http://www.berkeley.edu/students/john_doe/my_dog.html). If you're Perl savvy, you may have noticed the trailing /?$; this allows for the eventuality that a top-level URL ends with a slash (e.g., http://www.berkeley.edu/), as is often true. 2.12.2. Running the HackThis hack runs as a CGI script. Figure 2-6 shows the results of a search for non-gmo (Genetically Modified Organisms, that is). Figure 2-6. GooTop search for non-gmo2.12.3. Hacking the HackThere are a couple of ways to hack this hack. 2.12.3.1 More depthPerhaps your interests lie in just how deep results are within a site or sites. A minor adjustment or two to the code and you have results grouped by depth: #!/usr/bin/perl # gootop.cgi # Separates out top level and sub-level results # gootop.cgi is called as a CGI with form input. # Your Google API developer's key. my $google_key='insert key here'; # Location of the GoogleSearch WSDL file. my $google_wdsl = "./GoogleSearch.wsdl"; # Number of times to loop, retrieving 10 results at a time. my $loops = 1; use strict; use lib qw!/home/rael/lib/perl!; #FIXME use SOAP::Lite; use CGI qw/:standard *table/; print header( ), start_html("GooTop"), h1("GooTop"), start_form(-method=>'GET'), 'Query: ', textfield(-name=>'query'), ' ', submit(-name=>'submit', -value=>'Search'), end_form( ), p( ); my $google_search = SOAP::Lite->service("file:$google_wdsl"); if (param('query')) { my @list = ( ); for (my $offset = 0; $offset <= $loops*10; $offset += 10) { my $results = $google_search -> doGoogleSearch( $google_key, param('query'), $offset, 10, "false", "", "false", "", "latin1", "latin1" ); foreach (@{$results->{'resultElements'}}) { push @{ $list[scalar ( split(/\//, $_->{URL} . ' ') - 3 ) ] }, p( b($_->{title}||'no title'), br( ), a({href=>$_->{URL}}, $_->{URL}), br( ), i($_->{snippet}||'no snippet') ); } } for my $level (1..$#list) { print h2("Level: $level"); ref $list[$level] eq 'ARRAY' and print join "\n", @{$list[$level]}; } } print end_html; Figure 2-7 shows that non-gmo search again using the depth hack. Figure 2-7. GooTop non-gmo search using depth hack2.12.3.2 Query tipsAlong with the aforementioned code hacking, here are a few query tips to use with this hack:
|
< Day Day Up > |