Hack 57. Scrape Google Groups

Pull results from Google Groups searches a comma-delimited file.

It's easy to look at the Internet and say that it's a group of web pages or computers or networks. But look a little deeper and you'll see that the core of the Internet is discussions: mailing lists, online forums, and even web sites, where people hold forth in glorious HTML, waiting for people to drop by, consider their philosophies, make contact, or buy their products and services.

Nowhere is the Internet-as-conversation idea more prevalent than in Usenet newsgroups. Google Groups has an archive of over 800 million messages from years of Usenet traffic. If you're researching a particular time, searching and saving Google Groups message pointers comes in really handy.

Because Google Groups is not searchable by the current version of the Google API, you can't build an automated Google Groups query tool without violating Google's Terms of Service. However, you can scrape the HTML of a page you visit personally and save to your hard drive.

The first thing that you need to do is run a Google Groups Search. See the "Google Groups" section earlier in this chapter for some hints on the best practices for searching this massive message archive.

This hack works with Google Groups, not Google Groups 2. While any sort of scraping is brittle, we expect Version 2 to change form many times in the very near future and wanted to be sure you had the best chance of success with this hack.

It's best to put pages that you're going to scrape in order of date; that way if you're going to scrape more pages later, it's easy to look at them and check the last date that the search results changed. Let's say that you're trying to keep up with uses of Perl in programming the Google API; your query might look like this:

perl group:google.public.web-apis

On the right side of the results page is an option to sort either by relevance or date; click the "Sort by date" link. Your results page should look something like Figure 4-11.

Figure 4-11. The results of a Google Groups Search, sorted by date

Save this page to your hard drive, naming it something memorable, like groups.html.

Scraping is brittle at best. A single change in the HTML code underlying Google Groups pages and the script won't get very far.

At the time of this writing, a typical Google Groups Search result looks like this:

<a href=/groups?q=perl+group:google.public.web-apis&hl=en&lr=&c2coff=1&

safe=off&scoring=d&selm=bfd91813.0408311406.21d2bb89%40posting.google.com&rnum

=1>queries or results ?</a><font size=-1><br> <b>...</b> 

Yet when making a query, via the <b>perl</b> Net::Google module, setting 

max_results to 50 works fine and returns 50 results, which was not what I had expected. 

<b>...</b> <br><font color=green><a href=/groups?hl=en&

lr=&c2coff=1&safe=off&group=google.public.web-apis class=a>google.public.

web-apis</a> - Aug 31, 2004 by sean - <a href=/groups?hl=en&lr=&c2coff=1

&safe=off&threadm=bfd91813.0408311406.21d2bb89%40posting.google.com&rnum=1

&prev=/groups%3Fq%3Dperl%2Bgroup:google.public.web-apis%26hl%3Den%26lr%3D%26c2coff%3D1

%26safe%3Doff%26sa%3DG%26scoring%3Dd class=a>View Thread (1 article)</a>

As with the HTML example given for Google News in [Hack #54], this might be utter gobbledygook for some of you. Those of you with an understanding of the code below should see why the regular expression matching was written in the way it was.

4.16.1. The Code

Save the following code as groups2csv.pl:

#!/usr/bin/perl

# groups2csv.pl

# Google Groups results exported to CSV suitable for import into Excel.

# Usage: perl groups2csv.pl < groups.html > groups.csv

     

# The CSV Header.

print qq{"title","url","group","date","author","number of articles"\n};

     

# The base URL for Google Groups.

my $url = "http://groups.google.com";

     

# Rake in those results.

my($results) = (join '', <>);

   

# Perform a regular expression match to glean individual results.

while ( $results =~ m!<a href=(/groups[^\>]+?rnum=[0-9]+)>(.+?)</a>.*?

<br>(.+?)<br>.*?<a href="?/groups.+?class=a>(.+?)</a> - (.+?) by 

(.+?)\s+.*?\(([0-9]+) article!mgis ) {

    my($path, $title, $snippet, $group, $date, $author, $articles) =

        ($1||'',$2||'',$3||'',$4||'',$5||'',$6||'',$7||'');

    $title =~ s!"!""!g; # double escape " marks

    $title =~ s!<.+?>!!g; # drop all HTML tags

    print qq{"$title","$url$path","$group","$date","$author","$articles"\n\n};

}

4.16.2. Running the Hack

Run the script from the command line ["How to Run the Hacks" in the Preface], specifying the Google Groups results filename that you saved earlier and the name of the CSV file that you wish to create or to which you wish to append additional results. For example, use groups.html as your input and groups.csv as your output:

$ perl groups2csv.pl <

 groups.html 

>

 groups.csv

Leaving off the > and CSV filename sends the results to the screen for your perusal.

Using >> before the CSV filename appends the current set of results to the CSV file, creating it if it doesn't already exist. This is useful for combining more than one set of results, represented by more than one saved results page:

$ perl groups2csv.pl

 

<

 results_1.html 

>

 results.csv



$ perl groups2csv.pl

 

<

 results_2.html 

>>

 results.csv

4.16.3. The Results

Scraping the results of a search for perl group:google.public.web-apis for anything mentioning the Perl programming language on the Google API's discussion forum looks like this:

$ perl groups2csv.pl < groups.html

"title","url","group","date","author","number of articles"

"queries or results ?","http://groups.google.com/groups?q=perl+group:google.public.

web-apis&hl=en&lr=&c2coff=1&safe=off&scoring=d&selm=bfd91813.

0408311406.21d2bb89%40posting.google.com&rnum=1","google.public.web-apis","Aug 31, 

2004","sean",

"1"

...

"Re: Whats the Difference between using the API and ordinary ... ","http://groups.google.

com/groups?q=perl+group:google.public.web-apis&hl=en&lr=&c2coff=1&safe=

off&scoring=d&selm=882fdb00.0405052309.44fe831b%40posting.google.com&rnum=7",

"google.public.web-apis","May 6, 2004","tonio","4"

...

< Day Day Up >