Hack 54. Scrape Google News

Scrape Google News Search results to get at the latest from thousands of aggregated news sources.

Google News, with its thousands of news sources worldwide, is a veritable treasure trove for any news hound. However, because you can't access Google News through the Google API [Chapter 9], you'll have to scrape your results from the HTML of a Google News results page. This hack does just that, gathering up results into a comma-delimited file suitable for loading into a spreadsheet or database. For each news story, it extracts the title, URL, source (i.e., news agency), publication date or age of the news item, and an excerpted description.

Because Google's Terms of Service prohibit the automated access of their search engines except through the Google API, this hack does not actually connect to Google. Instead, it works on a page of results that you've saved from a Google News Search that you've run yourself. Simply save the results page as HTML source using your browser's FileSave As... command.

Make sure the results are listed by date instead of relevance. When results are listed by relevance, some of the descriptions are missing because similar stories are clumped together. You can sort results by date by choosing the "Sort by date" link on the results page or by adding &scoring=d to the end of the results URL. Also, make sure you're getting the maximum number of results by adding &num=100 to the end of the results URL. For example, Figure 4-3 shows results of a query for election 2004, the latest on the 2004 United States Presidential Election—something of great import at the time of this writing.

Figure 4-3. Google News results for election 2004, sorted by date

Bear in mind that scraping web pages is a brittle occupation. A single change in the HTML code underlying the standard Google News results page and you're more than likely to end up with no results.

At the time of this writing, a typical Google News Search result looks a little something like this:

<a href="http://www.townhall.com/columnists/phyllisschlafly/ps20041018.shtml" id=r-1>

Show Me State debate spotlights conservative issues</a><br><font size=-1>

<font color=#6f6f6f>Town Hall,&nbsp;DC&nbsp;-</font> <nobr>8 

minutes ago</nobr></font><br><font size=-1><b>...</b> 

The <b>2004</b> <b>election</b> presents a stark choice to voters 

on many issues. The second debate highlighted those differences on taxes, sovereignty 

<b>...</b> </font><br>

While for most of you this is utter gobbledygook, it is probably of some use to those trying to understand the regular expression matching in the following script.

4.8.1. The Code

Save this code to a file called news2csv.pl:

#!/usr/bin/perl

# news2csv.pl

# Google News Results exported to CSV suitable for import into Excel.

# Usage: perl news2csv.pl < news.html > news.csv

     

print qq{"title","link","source","date or age", "description"\n};

     

my %unescape = ('&lt;'=>'<', '&gt;'=>'>', '&amp;'=>'&', 

  '&quot;'=>'"', '&nbsp;'=>' '); 

my $unescape_re = join '|' => keys %unescape;



my $results = join '', <>;

$results =~ s/($unescape_re)/$unescape{$1}/migs; # unescape HTML

$results =~ s![\n\r]! !migs; # drop spurious newlines



while ( $results =~ m!<a href="([^"]+)" ?>(.+?)</a>

<br>(.+?)<nobr>(.+?)</nobr>.*?<br>.+?)<br>migs  {

  my($url, $title, $source, $date_age, $description) = 

  ($1||'',$2||'',$3||'',$4||'', $5||'');

  $title =~ s!"!""!g; # double escape " marks

  $description =~ s!"!""!g;

  my $output =

    qq{"$title","$url","$source","$date_age","$description"\n};

  $output =~ s!<.+?>!!g; # drop all HTML tags

  print $output;

}

4.8.2. Running the Script

Run the script from the command line ["Running the Hacks" in the the Preface], specifying the Google News results HTML filename and the name of the CSV file that you wish to create or to which you wish to append additional results. For example, using news.html as our input and news.csv as our output:

$ perl news2csv.pl < 

news.html 

>

 news.csv

Leaving off the > and CSV filename sends the results to the screen for your perusal.

4.8.3. The Results

The following are some of the 20,300 results returned by a Google News Search for election 2004 and using the HTML page of results shown in Figure 4-3:

$ perl news2csv.pl < news.html

"title","link","source","date or age", "description"

"Bush and Kerry on the razor's edge","http://www.dallasnews.com/sharedcontent/dws/dn/

opinion/columnists/mdavis/stories/101304dnedidavis.97af9.html","Dallas Morning News 

(subscription),<A0>TX<A0>- ","12 minutes ago","... But just as the election 

is a referendum on Mr. Bush, debate analysis depends on ... Tonight history will write 

whether the Bush debate grade for 2004 is generally ...  "

"MINNESOTA: Notes and quotes from battleground Minnesota in ...",

"http://www.twincities.com

/mld/twincities/news/state/minnesota/9902651.htm","Pioneer Press (subscription),<A0>

MN<A0>- ","16 minutes ago","... the state. Both sides had assumed Minnesota would 

easily go Democratic, as it had for every presidential election since 1972. In ...  "

"MINNESOTA: Notes and quotes from battleground Minnesota in ...","http://www.

duluthsuperior.com/mld/duluthsuperior/news/politics/9902651.htm","Duluth News Tribune,<

A0>MN<A0>- ","19 minutes ago","... the state. Both sides had assumed Minnesota 

would easily go Democratic, as it had for every presidential election since 1972. In ...  "

...

Each listing actually occurs on its own line.

4.8.4. Hacking the Hack

You'll want to leave most of the news2csv.pl script alone, since it's been built to make sense out of the Google News formatting. If you don't like the way the program organizes the information that's taken out of the results page, you can change it. Just rearrange the variables on the following line, sorting them in any way that you choose. Be sure to keep a comma between each one.

my $output = 

  qq{"$title","$url","$source","$date_age","$description"\n};

For example, perhaps you want only the URL and title. The line should read:

my $output = 

  qq{"$url","$title"\n};

That \n specifies a new line, and the $ characters specify that $url and $title are variable names; keep them intact.

Of course, now your output won't match the header at the top of the CSV file, by default:

print qq{"title","link","source","date or age", "description"\n};

As before, simply change this to match, as follows:

print qq{"url","title"\n};

< Day Day Up >