< Day Day Up > |
Hack 49. Search Yesterday's IndexMonitor a set of queries for new finds added to the Google index yesterday. [Hack #48] is a simple web form-driven CGI script for building date range Google queries. A simple web-based interface is fine when you want to search for only one or two items at a time. But what of performing multiple searches over time, saving the results to your computer for comparative analysis? A better fit for this task is a client-side application that you run from the comfort of your own computer's desktop. This Perl script feeds specified queries to Google via the Google Web API, limiting results to those indexed yesterday. New finds are appended to a comma-delimited text file per query, suitable for import into Excel or your average database application.
2.31.1. The QueriesFirst, you'll need to prepare a few queries to feed the script. Try these out via the Google search interface itself first to make sure you're receiving the kind of results you're expecting. Your queries can be anything that you'd be interested in tracking over time: topics of long-lasting or current interest, searches for new directories of information [Hack #1] coming online, unique quotes from articles, or other sources that you want to monitor for signs of plagiarism. Use whatever special syntaxes you like except for link:; as you might remember, link: can't be used in concert with any other special syntax such as daterange:, upon which this hack relies. If you insist on trying anyway (e.g., link:www.yahoo.com daterange:2452421-2452521), Google will simply treat link as yet another query word (e.g., link www.yahoo.com), yielding some unexpected and useless results. Put each query on its own line. A sample query file will look something like this: "digital archives" intitle:"state library of" intitle:directory intitle:resources "now * * time for all good men * come * * aid * * party" Save the text file somewhere memorable; alongside the script you're about to write is as good a place as any. 2.31.2. The CodeSave the following code as goonow.pl. Be sure to replace insert key here with your Google API key along the way. #!/usr/local/bin/perl -w # goonow.pl # Feeds queries specified in a text file to Google, querying # for recent additions to the Google index. The script appends # to CSV files, one per query, creating them if they don't exist. # usage: perl goonow.pl [query_filename] # My Google API developer's key. my $google_key='insert key here'; # Location of the GoogleSearch WSDL file. my $google_wdsl = "./GoogleSearch.wsdl"; use strict; use SOAP::Lite; use Time::JulianDay; $ARGV[0] or die "usage: perl goonow.pl [query_filename]\n"; my $julian_date = int local_julian_day(time) - 2; my $google_search = SOAP::Lite->service("file:$google_wdsl"); open QUERIES, $ARGV[0] or die "Couldn't read $ARGV[0]: $!"; while (my $query = <QUERIES>) { chomp $query; warn "Searching Google for $query\n"; $query .= " daterange:$julian_date-$julian_date"; (my $outfile = $query) =~ s/\W/_/g; open (OUT, ">> $outfile.csv") or die "Couldn't open $outfile.csv: $!\n"; my $results = $google_search -> doGoogleSearch( $google_key, $query, 0, 10, "false", "", "false", "", "latin1", "latin1" ); foreach (@{$results->{'resultElements'}}) { print OUT '"' . join('","', ( map { s!\n!!g; # drop spurious newlines s!<.+?>!!g; # drop all HTML tags s!"!""!g; # double escape " marks $_; } @$_{'title','URL','snippet'} ) ) . "\"\n"; } } You'll notice that GooNow checks the day before yesterday's rather than yesterday's additions (my $julian_date = int local_julian_day(time) - 2;). Google indexes some pages very frequently; these show up in yesterday's additions and really bulk up your search results. So if you search for yesterday's results in addition to updated pages, you'll get a lot of noise, pages that Google indexes every day, rather than the fresh content that you're after. Skipping back one more day is a nice hack to get around the noise. 2.31.3. Running the HackThis script is invoked on the command line ["Running the Hacks" in Preface] like so: $ perl goonow.pl query_filename where query_filename is the name of the text file holding all the queries to be fed to the script. The file can be located either in the local directory or elsewhere; if the latter, be sure to include the entire path (e.g., /mydocu~1/hacks/queries.txt). Bear in mind that all output is directed to CSV files, one per query, so don't expect any fascinating output on the screen. 2.31.4. The ResultsHere's a quick look at one of the CSV output files created, intitle_state_library_of_.csv: "State Library of Louisiana","http://www.state.lib.la.us/"," ... Click here if you have any questions or comments. Copyright <C2><A9> 1998-2001 State Library of Louisiana Last modified: August 07, 2002. " "STATE LIBRARY OF NEW SOUTH WALES, SYDNEY AUSTRALIA","http://www.slnsw.gov.au/", " ... State Library of New South Wales Macquarie St, Sydney NSW Australia 2000 Phone: +61 2 9273 1414 Fax: +61 2 9273 1255. Your comments You could win a prize! ... " "State Library of Victoria","http://www.slv.vic.gov.au/"," ... clicking on our logo. State Library of Victoria Logo with link to homepage State Library of Victoria. A world class cultural resource ... " ... 2.31.5. Hacking the HackThe script keeps appending new finds to the appropriate CSV output file. If you wish to reset the CSV files associated with particular queries, simply delete them, and the script will create them anew. Or you can make one slight adjustment to have the script create the CSV files anew each time, overwriting the previous version, like so: ... (my $outfile = $query) =~ s/\W/_/g; open (OUT, "> $outfile.csv") or die "Couldn't open $outfile.csv: $!\n"; my $results = $google_search -> doGoogleSearch( $google_key, $query, 0, 10, "false", "", "false", "", "latin1", "latin1" ); ... Notice the only change in the code is the removal of one of the > characters when the output file is created—i.e., open (OUT, "> $outfile.csv") instead of open (OUT, ">> $outfile.csv"). |
< Day Day Up > |