Team LiB
Previous Section Next Section

#78 Reporting Broken External Links

This partner script to Script #77, Identifying Broken Internal Links, utilizes the -traversal option of lynx to generate and test a set of external links — links to other websites. When run as a traversal of a site, lynx produces a number of data files, one of which is called reject.dat. The reject.dat file contains a list of all external links, both website links and mailto: links. By iteratively trying to access each http link in reject.dat, you can quickly ascertain which sites work and which sites fail to resolve, which is exactly what this script does.

The Code

#!/bin/sh

# checkexternal - Traverses all internal URLs on a website to build a
#   list of external references, then checks each one to ascertain
#   which might be dead or otherwise broken. The -a flag forces the
#   script to list all matches, whether they're accessible or not: by
#   default only unreachable links are shown.

lynx="/usr/local/bin/lynx"      # might need to be tweaked
listall=0; errors=0             # shortcut: two vars on one line!

if [ "$1" = "-a" ] ; then
  listall=1; shift
fi

outfile="$(echo "$1" | cut -d/ -f3).external-errors"

/bin/rm -f $outfile     # clean it for new output

trap "/bin/rm -f traverse*.errors reject*.dat traverse*.dat" 0

if [ -z "$1" ] ; then
  echo "Usage: $(basename $0) [-a] URL" >&2
  exit 1
fi

# Create the data files needed
$lynx -traversal $1 > /dev/null;
if [ -s "reject.dat" ] ; then
  # The following line has a trailing space after the backslash!
  echo -n $(sort -u reject.dat | wc -l) external links encountered
  echo in $(grep '^http' traverse.dat | wc -l) pages

  for URL in $(grep '^http:' reject.dat | sort -u)
  do
    if ! $lynx -dump $URL > /dev/null 2>&1 ; then
      echo "Failed : $URL" >> $outfile
      errors="$(($errors + 1))"
    elif [ $listall -eq 1 ] ; then
      echo "Success: $URL" >> $outfile
    fi
  done

  if [ -s $outfile ] ; then
    cat $outfile
    echo "(A copy of this output has been saved in ${outfile})"
  elif [ $listall -eq 0 -a $errors -eq 0 ] ; then
    echo "No problems encountered."
  fi
else
  echo -n "No external links encountered ";
  echo in $(grep '^http' traverse.dat | wc -l) pages.
fi

exit 0

How It Works

This is not the most elegant script in this book. It's more of a brute-force method of checking external links, because for each external link found, the lynx command tests the validity of the link by trying to grab the contents of its URL and then discarding them as soon as they've arrived, as shown in the following block of code:

    if ! $lynx -dump $URL > /dev/null 2>&1 ; then
      echo "Failed : $URL" >> $outfile
      errors="$(($errors + 1))"
    elif [ $listall -eq 1 ] ; then
      echo "Success: $URL" >> $outfile
    fi

The notation 2>&1 is worth mentioning here: It causes output device #2 to be redirected to whatever output device #1 is set to. With a shell, output #2 is stderr (for error messages) and output #1 is stdout (regular output). Used alone, 2>&1 will cause stderr to go to stdout. In this instance, however, notice that prior to this redirection, stdout is already redirected to the so-called bit bucket of /dev/null (a virtual device that can be fed an infinite amount of data without ever getting any bigger. Think of a black hole, and you'll be on the right track). Therefore, this notation ensures that stderr is also redirected to /dev/null. We're throwing all of this information away because all we're really interested in is whether lynx returns a zero or nonzero return code from this command (zero indicates success; nonzero indicates an error).

The number of internal pages traversed is calculated by the line count of the file traverse.dat, and the number of external links is found by looking at reject.dat. If the -a flag is specified, the output lists all external links, whether they're reachable or not; otherwise only failed URLs are displayed.

Running the Script

To run this script, simply specify the URL of a site to check.

The Results

Let's check a simple site with a known bad link. The -a flag lists all external links, valid or not.

$ checkexternal -a http://www.ourecopass.org/
8 external links encountered in 4 pages
Failed : http://www.badlink/somewhere.html
Success: http://www.ci.boulder.co.us/goboulder/
Success: http://www.ecopass.org/
Success: http://www.intuitive.com/
Success: http://www.ridearrangers.org/
Success: http://www.rtd-denver.com/
Success: http://www.transitalliance.org/
Success: http://www.us36tmo.org/
(A copy of this output has been saved in www.ourecopass.org.external-errors)

To find the bad link, we can easily use the grep command on the set of HTML source files:

$ grep 'badlink/somewhere.html' ~ecopass/*
~ecopass/contact.html:<a href="http://www.badlink/somewhere.html">bad </a>

With a larger site, well, the program can run for a long, long time. The following took three hours to finish testing:

$ date ; checkexternal http://www.intuitive.com/ ; date
Tue Sep 16 23:16:37 GMT 2003
733 external links encountered in 728 pages
Failed : http://chemgod.slip.umd.edu/~kidwell/weather.html
Failed : http://epoch.oreilly.com/shop/cart.asp
Failed : http://ezone.org:1080/ez/
Failed : http://techweb.cmp.com/cw/webcommerce/
Failed : http://tenbrooks11.lanminds.com/
Failed : http://www.builder.cnet.com/
Failed : http://www.buzz.builder.com/
Failed : http://www.chem.emory.edu/html/html.html
Failed : http://www.truste.org/
Failed : http://www.wander-lust.com/
Failed : http://www.websitegarage.com/
(A copy of this output has been saved in www.intuitive.com.external-errors)
Wed Sep 17 02:11:18 GMT 2003

Looks as though it's time for some cleanup work!


Team LiB
Previous Section Next Section
This HTML Help has been published using the chm2web software.