Script #84, Exploring the Apache access_log, can offer a broad-level overview of some of the search engine queries that point to your site, but further analysis can reveal not just which search engines are delivering traffic, but what keywords were entered by users who arrived at your site via search engines. This information can be invaluable for understanding whether your site has been properly indexed by the search engines and can provide the starting point for improving the rank and relevancy of your search engine listings.
#!/bin/sh # searchinfo - Extracts and analyzes search engine traffic indicated in the # referrer field of a Common Log Format access log. host="intuitive.com" # change to your domain, as desired maxmatches=20 count=0 temp="/tmp/$(basename $0).$$" trap "/bin/rm -f $temp" 0 if [ $# -eq 0 ] ; then echo "Usage: $(basename $0) logfile" >&2 exit 1 fi if [ ! -r "$1" ] ; then echo "Error: can't open file $1 for analysis." >&2 exit 1 fi for URL in $(awk '{ if (length($11) > 4) { print $11 } }' "$1" | \ grep -vE "(/www.$host|/$host)" | grep '?') do searchengine="$(echo $URL | cut -d/ -f3 | rev | cut -d. -f1-2 | rev)" args="$(echo $URL | cut -d\? -f2 | tr '&' '\n' | \ grep -E '(^q=|^sid=|^p=|query=|item=|ask=|name=|topic=)' | \ sed -e 's/+/ /g' -e 's/%20/ /g' -e 's/"//g' | cut -d= -f2)" if [ ! -z "$args" ] ; then echo "${searchengine}: $args" >> $temp else # No well-known match, show entire GET string instead... echo "${searchengine} $(echo $URL | cut -d\? -f2)" >> $temp fi count="$(( $count + 1 ))" done echo "Search engine referrer info extracted from ${1}:" sort $temp | uniq -c | sort -rn | head -$maxmatches | sed 's/^/ /g' echo "" echo Scanned $count entries in log file out of $(wc -l < "$1") total. exit 0
The main for loop of this script extracts all entries in the log file that have a valid referrer with a string length greater than 4, a referrer domain that does not match the $host variable, and a ? in the referrer string (indicating that a user search was performed):
for URL in $(awk '{ if (length($11) > 4) { print $11 } }' "$1" | \ grep -vE "(/www.$host|/$host)" | grep '?')
The script then goes through various steps in the ensuing lines to identify the domain name of the referrer and the search value entered by the user:
searchengine="$(echo $URL | cut -d/ -f3 | rev | cut -d. -f1-2 | rev)" args="$(echo $URL | cut -d\? -f2 | tr '&' '\n' | \ grep -E '(^q=|^sid=|^p=|query=|item=|ask=|name=|topic=)' | \ sed -e 's/+/ /g' -e 's/%20/ /g' -e 's/"//g' | cut -d= -f2)"
An examination of hundreds of search queries shows that common search sites use a small number of common variable names. For example, search on Yahoo.com and your search string is p=pattern. Google and MSN use q as the search variable name. The grep invocation contains p, q, and the other most common search variable names.
The last line, the invocation of sed, cleans up the resultant search patterns, replacing + and %20 sequences with spaces and chopping quotes out, and then the cut command returns everything that occurs after the first equal (=) sign — in other words, just the search terms.
The conditional immediately following these lines tests to see if the args variable is empty or not. If it is (that is, if the query format isn't a known format), then it's a search engine we haven't seen, so we output the entire pattern rather than a cleaned-up pattern-only value.
To run this script, simply specify the name of an Apache or other Common Log Format log file on the command line.
Speed warning! |
This is one of the slowest scripts in this book, because it's spawning lots and lots of subshells to perform various tasks, so don't be surprised if it takes a while to run. |
$ searchinfo /web/logs/intuitive/access_log Search engine referrer info extracted from /web/logs/intuitive/access_log: 19 msn.com: little big horn 14 msn.com: custer 11 google.com: cool web pages 10 msn.com: plains 9 msn.com: Little Big Horn 9 google.com: html 4 entities 6 msn.com: Custer 4 msn.com: the plains indians 4 msn.com: little big horn battlefield 4 msn.com: Indian Wars 4 google.com: newsgroups 3 yahoo.com: cool web pages 3 ittoolbox.com i=1186" 3 google.it: jungle book kipling plot 3 google.com: cool web graphics 3 google.com: colored bullets CSS 2 yahoo.com: unix%2Bhogs 2 yahoo.com: cool HTML tags 2 msn.com: www.custer.com Scanned 466 entries in log file out of 11406 total.
You can tweak this script in a variety of ways to make it more useful. One obvious one is to skip the referrer URLs that are (most likely) not from search engines. To do so, simply comment out the else clause in the following passage:
if [ ! -z "$args" ] ; then echo "${searchengine}: $args" >> $temp else # No well-known match, show entire GET string instead... echo "${searchengine} $(echo $URL | cut -d\? -f2)" >> $temp fi
To be fair, ex post facto analysis of search engine traffic is difficult. Another way to approach this task would be to search for all hits coming from a specific search engine, entered as the second command argument, and then to compare the search strings specified. The core for loop would change, but, other than a slight tweak to the usage message, the script would be identical to the searchinfo script:
for URL in $(awk '{ if (length($11) > 4) { print $11 } }' "$1" | \ grep $2) do args="$(echo $URL | cut -d\? -f2 | tr '&' '\n' | \ grep -E '(^q=|^sid=|^p=|query=|item=|ask=|name=|topic=)' | \ cut -d= -f2)" echo $args | sed -e 's/+/ /g' -e 's/"//g' >> $temp count="$(($count + 1))" done
The results of this new version, given google.com as an argument, are as follows:
$ enginehits /web/logs/intuitive/access_log google.com Search engine referrer info extracted google searches from /web/logs/intuitive/access_log: 13 cool web pages 10 9 html 4 entities 4 newsgroups 3 solaris 9 3 jungle book kipling plot 3 intuitive 3 cool web graphics 3 colored bullets CSS 2 sun solaris operating system reading material 2 solaris unix 2 military weaponry 2 how to add program to sun solaris menu 2 dynamic html border 2 Wallpaper Nikon 2 HTML for heart symbol 2 Cool web pages 2 %22Military weaponry%22 1 www%2fvoices.com 1 worst garage door opener 1 whatis artsd 1 what%27s meta tag Scanned 232 google entries in log file out of 11481 total.
If most of your traffic comes from a few search engines, you could analyze those engines separately and then list all traffic from other search engines at the end of the output.
This HTML Help has been published using the chm2web software. |