If you're running a web server or are responsible for a website, simple or complex, you find yourself performing some tasks with great frequency, ranging from identifying broken internal and external site links to checking for spelling errors on web pages. Using shell scripts, you can automate these tasks, as well as some common client/ server tasks, such as ensuring that a remote directory of files is always completely in sync with a local copy, to great effect.
The scripts in Chapter 7 highlighted the value and capabilities of the lynx text-only web browser, but there's even more power hidden within this tremendous software application. One capability that's particularly useful for a web administrator is the traverse function (which you enable by using -traversal), which causes lynx to try to step through all links on a site to see if any are broken. This feature can be harnessed in a short script.
#!/bin/sh # checklinks - Traverses all internal URLs on a website, reporting # any errors in the "traverse.errors" file. lynx="/usr/local/bin/lynx" # this might need to be tweaked # Remove all the lynx traversal output files upon completion: trap "/bin/rm -f traverse*.errors reject*.dat traverse*.dat" 0 if [ -z "$1" ] ; then echo "Usage: checklinks URL" >&2 ; exit 1 fi $lynx -traversal "$1" > /dev/null if [ -s "traverse.errors" ] ; then echo -n $(wc -l < traverse.errors) errors encountered. echo Checked $(grep '^http' traverse.dat | wc -l) pages at ${1}: sed "s|$1||g" < traverse.errors else echo -n "No errors encountered. "; echo Checked $(grep '^http' traverse.dat | wc -l) pages at ${1} exit 0 fi baseurl="$(echo $1 | cut -d/ -f3)" mv traverse.errors ${baseurl}.errors echo "(A copy of this output has been saved in ${baseurl}.errors)" exit 0
The vast majority of the work in this script is done by lynx; the script just fiddles with the resultant lynx output files to summarize and display the data attractively. The lynx output file reject.dat contains a list of links pointing to external URLs (see Script #78, Reporting Broken External Links, for how to exploit this data); traverse.errors contains a list of failed, invalid links (the gist of this script); traverse.dat contains a list of all pages checked; and traverse2.dat is identical to traverse.dat except that it also includes the title of every page visited.
To run this script, simply specify a URL on the command line. Because it goes out to the network, you can traverse and check any website, but beware: Checking something like Google or Yahoo! will take forever and eat up all of your disk space in the process.
First off, let's check a tiny website that has no errors:
$ checklinks http://www.ourecopass.org/ No errors encountered. Checked 4 pages at http://www.ourecopass.org/
Sure enough, all is well. How about a slightly larger site?
$ checklinks http://www.clickthrustats.com/ 1 errors encountered. Checked 9 pages at http://www.clickthrustats.com/: contactus.shtml in privacy.shtml (A copy of this output has been saved in www.clickthrustats.com.errors)
This means that the file privacy.shtml contains a link to contactus.shtml that cannot be resolved: The file contactus.shtml does not exist. Finally, let's check my main website to see what link errors might be lurking:
$ date ; checklinks http://www.intuitive.com/ ; date Tue Sep 16 21:55:39 GMT 2003 6 errors encountered. Checked 728 pages at http://www.intuitive.com/: library/f8 in library/ArtofWriting.shtml library/f11 in library/ArtofWriting.shtml library/f16 in library/ArtofWriting.shtml library/f18 in library/ArtofWriting.shtml articles/cookies/ in articles/csi-chat.html ~taylor in articles/aol-transcript.html (A copy of this output has been saved in www.intuitive.com.errors) Tue Sep 16 22:02:50 GMT 2003
Notice that adding a call to date before and after a long command is a lazy way to see how long the command takes. Here you can see that checking the 728-page intuitive.com site took just over seven minutes.
The grep statement in this script produces a list of all files checked, which can be fed to wc -l to ascertain how many pages have been examined. The actual errors are found in the traverse.errors file:
echo Checked $(grep '^http' traverse.dat | wc -l) pages at ${1}: sed "s|$1||g" < traverse.errors
To have this script report on image (img) reference errors instead, grep the traverse.errors file for gif, jpeg, or png filename suffixes before feeding the result to the sed statement (which just cleans up the output format to make it attractive).
This HTML Help has been published using the chm2web software. |