Hack 35. SafeSearch Certify URLs
Feed URLs into Google's
SafeSearch to determine whether they point at questionable content.
Only three things in life are certain: death, taxes, and accidentally
visiting a once family-safe web site that now contains text and
images that would make a horse blush.
As you probably know if you've ever put up a web
site, domain names are registered for finite lengths of time.
Sometimes registrations accidentally expire; sometimes businesses
fold and allow the registrations to expire; sometimes other companies
take them over.
Other companies might just want the domain name, some companies want
the traffic that the defunct site generated, and in a few cases, the
new owners of the domain name try to hold it hostage, offering to
sell it back to the original owners for a great deal of money. (This
doesn't work as well as it used to because of the
dearth of Internet companies that actually have a great deal of
money.)
When a site isn't what it once was,
that's no big deal. When it's not
what it once was and is now X-rated, that's a bigger
deal. When it's not what it once was, is now
X-rated, and is on the link list of a site you run,
that's a really big deal.
But how to keep up with all the links? You can visit each link
periodically to determine if it's still okay, you
can wait for hysterical emails from site visitors, or you can just
not worry about it. Or you can put the Google API to work.
This program lets you check a list of URLs in
Google's SafeSearch mode. If they appear in the
SafeSearch mode, they're probably okay. If they
don't appear, they're either not in
Google's index or not
"safe" enough to pass through
Google's filter. The program then checks the URLs
missing from a SafeSearch with a nonfiltered search. If they do not
appear in a nonfiltered search, they're labeled as
unindexed. If they do appear in a nonfiltered search,
they're labeled as
"suspect."
2.17.1. Danger, Will Robinson!
While Google's SafeSearch filter is good,
it's not infallible. (I have yet to see an automated
filtering system that is infallible.) So if you run a list of URLs
through this hack and they all show up in a SafeSearch query,
don't take that as a guarantee that
they're all completely inoffensive. Take it merely
as a pretty good indication that they are. If you want absolute
assurance, you're going to have to visit every link
personally and frequently.
|
Here's a fun idea if you need an Internet-related
research project. Take 500 or so domain names at random and run this
program on the list once a week for several months, saving the
results to a file each time. It'd be interesting to
see how many domains/URLs end up being filtered out of SafeSearch
over time.
|
|
2.17.2. The Code
Save the following Perl source code as a text file named
suspect.pl:
#!/usr/local/bin/perl
# suspect.pl
# Feed URLs to a Google SafeSearch. If inurl: returns results, the
# URL probably isn't questionable content. If inurl: returns no
# results, either it points at questionable content or isn't in
# the Google index at all.
# Your Google API developer's key.
my $google_key = 'put your key here';
# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";
use strict;
use SOAP::Lite;
$|++; # turn off buffering
my $google_search = SOAP::Lite->service("file:$google_wdsl");
# CSV header
print qq{"url","safe/suspect/unindexed","title"\n};
while (my $url = <>) {
chomp $url;
$url =~ s!^\w+?://!!;
$url =~ s!^www\.!!;
# SafeSearch
my $results = $google_search ->
doGoogleSearch(
$google_key, "inurl:$url", 0, 10, "false", "", "true",
"", "latin1", "latin1"
);
print qq{"$url",};
if (grep /$url/, map { $_->{URL} } @{$results->{resultElements}}) {
print qq{"safe"\n};
}
else {
# unSafeSearch
my $results = $google_search ->
doGoogleSearch(
$google_key, "inurl:$url", 0, 10, "false", "", "false",
"", "latin1", "latin1"
);
# Unsafe or Unindexed?
print (
(scalar grep /$url/, map { $_->{URL} } @{$results->{resultElements}})
? qq{"suspect"\n}
: qq{"unindexed"\n}
);
}
}
2.17.3. Running the Hack
To run the hack, you'll need a text file that
contains the URLs that you want to check, one line per URL. For
example:
http://www.oreilly.com/catalog/essblogging/
http://www.xxxxxxxxxx.com/preview/home.htm
hipporhinostricow.com
The program runs from the command line ["How to Run
the Hacks" in the Preface]. Enter the name of the
script, a less-than sign, and the name of the text file that contains
the URLs that you want to check. The program will return results that
look like this:
% perl suspect.pl < urls.txt
"url","safe/suspect/unindexed"
"oreilly.com/catalog/essblogging/","safe"
"xxxxxxxxxx.com/preview/home.htm","suspect"
"hipporhinostricow.com","unindexed"
The first item is the URL being checked, and the second is
it's probable safety rating as follows:
- safe
-
The URL appeared in a Google SafeSearch for the URL.
- suspect
-
The URL did not appear in a Google SafeSearch but did in an
unfiltered search.
- unindexed
-
The URL appeared in neither a SafeSearch nor unfiltered search.
You can redirect output from the script to a file for import into a
spreadsheet or database:
% perl suspect.pl < urls.txt > urls.csv
2.17.4. Hacking the Hack
You can use this hack interactively, feeding it URLs one at a time.
Invoke the script with perl suspect.pl, but
don't feed it a text file of URLs to check. Enter a
URL and hit the return key on your keyboard. The script will reply in
the same manner that it does when fed multiple URLs. This is handy
when you just need to spot-check a couple of URLs on the command
line. When you're ready to quit, break out of the
script using Ctrl-D under Unix or Ctrl-Break on a Windows command
line.
Here's a transcript of an interactive session with
suspect.pl:
% perl suspect.pl
"url","safe/suspect/unindexed","title"
http://www.oreilly.com/catalog/essblogging/
"oreilly.com/catalog/essblogging/","safe"
http://www.xxxxxxxxxx.com/preview/home.htm
"xxxxxxxxxx.com/preview/home.htm","suspect"
hipporhinostricow.com
"hipporhinostricow.com","unindexed"
^d
%
|