Team LiB
Previous Section Next Section

#65 Digging Up Movie Info from IMDb

A more sophisticated use of Internet access through lynx and a shell script is demonstrated in this hack, which searches the Internet Movie Database website (http://www.imdb.com/) to find films that match a specified pattern. What makes this script interesting is that it must be able to handle two different formats of return information: If the search pattern matches more than one movie, moviedata returns a list of possible titles, but if there's exactly one movie match, the information about that specific film is returned.

As a result, the script must cache the return information and then search through it once to see if it provides a list of matches and then a second time if it proves to be a summary of the film in question.

The Code

#!/bin/sh

# moviedata - Given a movie title, returns a list of matches, if
#   there's more than one, or a synopsis of the movie if there's
#   just one. Uses the Internet Movie Database (imdb.com).

imdburl="http://us.imdb.com/Tsearch?restrict=Movies+only&title="
titleurl="http://us.imdb.com/Title?"
tempout="/tmp/moviedata.$$"

summarize_film()
{
   # Produce an attractive synopsis of the film

   grep "^<title>" $tempout | sed 's/<[^>]*>//g;s/(more)//'
   grep '<b class="ch">Plot Outline:</b>' $tempout | \
     sed 's/<[^>]*>//g;s/(more)//;s/(view trailer)//' |fmt|sed 's/^/ /'
   exit 0
}

trap "rm -f $tempout" 0 1 15

if [ $# -eq 0 ] ; then
  echo "Usage: $0 {movie title | movie ID}" >&2
  exit 1
fi

fixedname="$(echo $@ | tr ' ' '+')"     # for the URL

if [ $# -eq 1 ] ; then
  nodigits="$(echo $1 | sed 's/[[:digit:]]*//g')"
  if [ -z "$nodigits" ] ; then
    lynx -source "$titleurl$fixedname" > $tempout
    summarize_film
  fi
fi

url="$imdburl$fixedname"

lynx -source $url > $tempout

if [ ! -z "$(grep "IMDb title search" $tempout" ] ; then
  grep 'HREF="/Title?' $tempout | \
    sed 's/<OL><LI><A href="//;s/<\/a><\/li>//;s/<li><a href="//' | \
    sed 's/">/ -- /;s/<.*//;s/\/Title?//' | \
    sort -u | \
    more
else
  summarize_film
fi

exit 0

How It Works

This script builds a different URL depending on whether the command argument specified is a film name or an IMDb film ID number, and then it saves the lynx output from the web page to the $tempout file.

If the command argument is a film name, the script then examines $tempout for the string "IMDb title search" to see whether the file contains a list of film names (when more than one movie matches the search criteria) or the description of a single film. Using a complex series of sed substitutions that rely on the source code organization of the IMDb site, it then displays the output appropriately for each of those two possible cases.

Running the Script

Though short, this script is quite flexible with input formats: You can specify a film title in quotes or as separate words. If more than one match is returned, you can then specify the eight-digit IMDb ID value to select a specific match.

The Results

$ moviedata lawrence of arabia
0056172 -- Lawrence of Arabia (1962)
0099356 -- Dangerous Man: Lawrence After Arabia, A (1990) (TV)
0194547 -- With Allenby in Palestine and Lawrence in Arabia (1919)
0245226 -- Lawrence of Arabia (1935)
0363762 -- Lawrence of Arabia: A Conversation with Steven Spielberg (2000) (V)
0363791 -- Making of 'Lawrence of Arabia', The (2000) (V)
$ moviedata 0056172
Lawrence of Arabia (1962)
  Plot Outline: British lieutenant T.E. Lawrence rewrites the political
  history of Saudi Arabia.
$ moviedata monsoon wedding
Monsoon Wedding (2001)
  Plot Outline: A stressed father, a bride-to-be with a secret, a
  smitten event planner, and relatives from around the world create
  much ado about the preparations for an arranged marriage in India.

Hacking the Script

The most obvious hack to this script would be to get rid of the ugly IMDb movie ID numbers. It would be straightforward to hide the movie IDs (because the IDs as shown are rather unfriendly and prone to mistyping) and have the shell script output a simple menu with unique index values (e.g., 1, 2, 3) that can then be typed in to select a particular film.

A problem with this script, as with most scripts that scrape values from a third-party website, is that if IMDb changes its page layout, the script will break and you'll need to rebuild the script sequence. It's a lurking bug, but with a site like IMDb that hasn't changed in years, probably not a dramatic or dangerous one.


Team LiB
Previous Section Next Section
This HTML Help has been published using the chm2web software.