This script, webspell, is an amalgamation of ideas presented in earlier scripts, particularly Script #27, Adding a Local Dictionary to Spell, which demonstrates how to interact with the aspell spelling utility and how to filter its reported misspellings through your own list of additional acceptable words. It relies on the lynx program to pull all the text out of the HTML of a page, either local or remote, and then feeds the resultant text to aspell or an equivalent spelling program.
#!/bin/sh # webspell - Uses the spell feature + lynx to spell-check either a # web page URL or a file. # Inevitably you'll find that there are words it flags as wrong but # you think are fine. Simply save them in a file, one per line, and # ensure that 'okaywords' points to that file. okaywords="$HOME/bin/.okaywords" tempout="/tmp/webspell.$$" trap "/bin/rm -f $tempout" 0 if [ $# -eq 0 ] ; then echo "Usage: webspell file|URL" >&2; exit 1 fi for filename do if [ ! -f "$filename" -a "$(echo $filename|cut -c1-7)" != "http://" ] then continue # picked up directory in '*' listing fi lynx -dump $filename | tr ' ' '\n' | sort -u | \ grep -vE "(^[^a-z]|')" | \ # Adjust the following line to produce just a list of misspelled words ispell -a | awk '/^\&/ { print $2 }' | \ sort -u > $tempout if [ -r $okaywords ] ; then # If you have an okaywords file, screen okay words out grep -vif $okaywords < $tempout > ${tempout}.2 mv ${tempout}.2 $tempout fi if [ -s $tempout ] ; then echo "Probable spelling errors: ${filename}" cat $tempout | paste - - - - | sed 's/^/ /' fi done exit 0
Using the helpful lynx command, this script extracts just the text from each of the specified pages and then feeds the result to a spell-checking program (ispell in this example, though it works just as well with aspell or another spelling program. See Script #25, Checking the Spelling of Individual Words, for more information about different spell-checking options in Unix).
Notice the file existence test in this script too:
if [ ! -f "$filename" -a "$(echo $filename|cut -c1-7)" != "http://"
It can't just fail if the given name isn't readable, because $filename might actually be a URL, so the test becomes rather complex. However, when referencing filenames, the script can work properly with invocations like webspell *, though you'll get better results with a filename wildcard that matches only HTML files. Try webspell *html instead.
Whichever spell-checking program you use, you'll need to ensure that the result of the following line is a list only of misspelled words, with none of the spell-checking utility's special formatting included:
ispell -a | awk '/^\&/ { print $2 }' | \
This spell line is but one part of a quite complex pipeline that extracts the text from the page, translates it to one word per line (the tr invocation), sorts the words, and ensures that each one appears only once in the pipeline (sort -u). After the sort operation, we screen out all the lines that don't begin with a lowercase letter (that is, all punctuation, HTML tags, and other content). Then the next line of the pipe runs the data stream through the spell utility, using awk to extract the misspelled word from the oddly formatted ispell output. The results are run through a sort -u invocation, screened against the okaywords list with grep, and formatted for attractive output with paste (which produces four words per line in this instance).
This script can be given one or more web page URLs or a list of HTML files. To check the spelling of all source files in the current directory, for example, use *.html as the argument.
$ webspell http://www.clickthrustats.com/index.shtml *.html Probable spelling errors: http://www.clickthrustats.com/index.shtml cafepress microurl signup urlwire Probable spelling errors: 074-contactus.html webspell werd
In this case, the script checked a web page on the network from the Click-ThruStats.com site and five local HTML pages, finding the errors shown.
It would be a simple change to have webspell invoke the shpell utility presented in Script #26, but it can be dangerous correcting very short words that might overlap phrases or content of an HTML tag, JavaScript snippet, and so forth, so some caution is probably in order.
Also worth considering, if you're obsessed with avoiding any misspellings creeping into your website, is this: With a combination of correcting genuine misspellings and adding valid words to the okaywords file, you can reduce the output of webspell to nothing and then drop it into a weekly cron job to catch and report misspellings automatically.
This HTML Help has been published using the chm2web software. |