< Day Day Up > |
Hack 41. Scrape Yahoo! Buzz for a Google SearchA proof-of-concept hack scrapes the buzziest items from Yahoo! Buzz and submits them to a Google search. No web site is an island. Billions of hyperlinks link to billions of documents. Sometimes, however, you want to take information from one site and apply it to another site. Unless that site has a web service API like Google's, your best bet is scraping. Scraping is where you use an automated program to remove specific bits of information from a web page. Examples of the sorts of elements people scrape include stock quotes, news headlines, prices, and so forth. You name it and someone's probably scraped it. There's some controversy about scraping. Some sites don't mind it, while others can't stand it. If you decide to scrape a site, do it gently; take the minimum amount of information you need and, whatever you do, don't hog the scrapee's bandwidth. So, what are we scraping? Google has a query popularity page called Google Zeitgeist (http://www.google.com/press/zeitgeist.html). Unfortunately, the Zeitgeist is updated only once a week and contains only a limited amount of scrapable data. That's where Yahoo! Buzz (http://buzz.yahoo.com) comes in. The site is rich with constantly updated information. Its Buzz Index keeps tabs on what's hot in popular culture: celebs, games, movies, television shows, music, and more. This hack grabs the buzziest of the buzz, the top of the Leaderboard, and searches Google for all it knows on the subject. And to keep things current, only pages indexed by Google within the past few days [Hack #16] are considered.
2.23.1. The CodeSave the following code to a plain text file named buzzgle.pl: #!/usr/local/bin/perl
# buzzgle.pl
# Pull the top item from the Yahoo! Buzz Index and query the last
# three day's worth of Google's index for it.
# Usage: perl buzzgle.pl
# Your Google API developer's key.
my $google_key='insert key here';
# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";
# Number of days back to go in the Google index.
my $days_back = 3;
use strict;
use SOAP::Lite;
use LWP::Simple;
use Time::JulianDay;
# Scrape the top item from the Yahoo! Buzz Index.
# Grab a copy of http://buzz.yahoo.com.
my $buzz_content = get("http://buzz.yahoo.com/")
or die "Couldn't grab the Yahoo Buzz: $!";
# Find the first item on the Buzz Index list.
my($buzziest) = $buzz_content =~ m!http://search.yahoo.com/search\?p=.+">(.+?)<\/a>!i;
die "Couldn't figure out the Yahoo! buzz\n" unless $buzziest;
# Figure out today's Julian date.
my $today = int local_julian_day(time);
# Build the Google query.
my $query = "\"$buzziest\" daterange:" . ($today - $days_back) . "-$today";
print
"The buzziest item on Yahoo Buzz today is: $buzziest\n",
"Querying Google for: $query\n",
"Results:\n\n";
# Create a new SOAP::Lite instance, feeding it GoogleSearch.wsdl.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
# Query Google.
my $results = $google_search ->
doGoogleSearch(
$google_key, $query, 0, 10, "false", "", "false",
"", "latin1", "latin1"
);
# No results?
@{$results->{resultElements}} or die "No results";
# Loop through the results.
foreach my $result (@{$results->{'resultElements'}}) {
my $output =
join "\n",
$result->{title} || "no title",
$result->{URL},
$result->{snippet} || 'no snippet',
"\n";
$output =~ s!<.+?>!!g; # drop all HTML tags
print $output;
} 2.23.2. Running the HackThe script runs from the command line ["How to Run the Hacks" in the Preface] without need of arguments of any kind. Probably the best thing to do is to direct the output to a pager (a command-line application that allows you to page through long output, usually by hitting the spacebar), like so: % perl buzzgle.pl | more Or you can direct the output to a file for later perusal: % perl buzzgle.pl > buzzgle.txt As with all scraping applications, this code is fragile, subject to breakage if (read: when) HTML formatting of the Yahoo! Buzz page changes. If you find you have to adjust to match Yahoo!'s formatting, you'll have to alter the regular expression match as appropriate: my($buzziest) = $buzz_content =~ m!http://search.yahoo.com/search\?p=.+">(.+?)<\/a>!i;
2.23.3. The ResultsAt the time of this writing, Maria Sharapova, the Russian tennis star, is all the rage: % perl buzzgle.pl | less The buzziest item on Yahoo Buzz today is: Maria Sharapova Querying Google for: "Maria Sharapova" daterange:2453292-2453295 Results: Maria Sharapova http://www.mariaworld.net/ everything about Maria Sharapova: photos, interviews, articles, statistics, results and much more! ... Maria Sharapova: 2004 Tokyo Champion! ... Maria Sharapova http://www.mariaworld.net/photos.htm everything about Maria Sharapova: photos, interviews, articles, statistics, results and much more! HOME, BIOGRAPHY, PHOTOS, RESULTS, ... Maria Sharapova Picture Page http://milano.vinden.nl/ Maria Sharapova Picture Page. Country: Russia. Date of Birth: April 19, 1987. Place of Birth: Nyagan, Russia. Residence: Bradenton, Florida USA. Height: 1.83 metres ... 2.23.4. Hacking the HackHere are some ideas for hacking the hack:
|
< Day Day Up > |