Book HomeCGI Programming with PerlSearch this book

12.2. Searching One by One, Take Two

The search engine we will create in this section is much improved. It no longer depends on fgrep to carry out the search, which also means that we no longer have to use the shell. And thus, we will not run into an internal glob limit.

In addition, this application returns the matched content and highlights the query, which makes it much more useful as well.

How does it work? It creates a list of all the HTML files in the specified directory using Perl's own functions, and then iterates over each file searching for a line that contains a match for the query. All matches are stored in an array and are later converted to HTML.

Example 12-2 contains the new program.

Example 12-2. grep_search2.cgi

#!/usr/bin/perl -wT

use strict;
use CGI;
use CGIBook::Error;

my $DOCUMENT_ROOT = $ENV{DOCUMENT_ROOT};
my $VIRTUAL_PATH  = "";

my $q           = new CGI;
my $query       = $q->param( "query" );

if ( defined $query and length $query ) {
    error( $q, "Please specify a valid query!" );
}

$query = quotemeta( $query );
my $results = search( $q, $query );

print $q->header( "text/html" ),
      $q->start_html( "Simple Perl Search" ),
      $q->h1( "Search for: $query" ),
      $q->ul( $results || "No matches found" ),
      $q->end_html;


sub search {
    my( $q, $query ) = @_;
    my( %matches, @files, @sorted_paths, $results );
    
    local( *DIR, *FILE );
    
    opendir DIR, $DOCUMENT_ROOT or
        error( $q, "Cannot access search dir!" );
        
    @files = grep { -T "$DOCUMENT_ROOT/$_" } readdir DIR;
    close DIR;
    
    foreach ( @files ) {
        my $full_path = "$DOCUMENT_ROOT/$_";
        
        open FILE, $full_path or
            error( $q, "Cannot process $file!" );
        
        while ( <FILE> ) {
            if ( /$query/io ) {
                $_ = html_escape( $_ );
                s|$query|<B>$query</B>|gio;
                push @{ $matches{$full_path}{content} }, $_;
                $matches{$full_path}{file} = $file;
                $matches{$full_path}{num_matches}++;
            }
        }
        close FILE;
    }
    
    @sorted_paths = sort {
                        $matches{$b}{num_matches} <=>
                        $matches{$a}{num_matches} ||
                        $a cmp $b
                    } keys %matches;
    
    foreach $full_path ( @sorted_paths ) {
        my $file        = $matches{$full_path}{file};
        my $num_matches = $matches{$full_path}{num_matches};
        
        my $link = $q->a( { -href => "$VIRTUAL_PATH/$file" }, $file );
        my $content = join $q->br, @{ $matches{$full_path}{content} };
        
        $results .= $q->p( $q->b( $link ) . " ($num_matches matches)" .
                           $q->br . $content
                    );
    }
    
    return $results;
}


sub html_escape {
    my( $text ) = @_;
    
    $text =~ s/&/&amp;/g;
    $text =~ s/</&lt;/g;
    $text =~ s/>/&gt;/g;
}

This program starts out like our previous example. Since we are searching for the query without exposing it to the shell, we no longer have to strip out any characters from the query. Instead we escape any characters that may be interpreted in a regular expression by calling Perl's quotemeta function.

The opendir function opens the specified directory and returns a handle that we can use to get a list of all the files in that directory. It's a waste of time to search through binary files, such as sounds and images, so we use Perl's grep function (not to be confused with the Unix grep and fgrep applications) to filter them out.

In this context, the grep function iterates over a list of filenames returned by readdir -- setting $_ for each element -- and evaluates the expression specified within the braces, returning only the elements for which the expression is true.

We are using readdir in an array context so that we can pass the list of all files in the directory to grep for processing. But there is a problem with this approach. The readdir function simply returns the name of the file and not the full path, which means that we have to construct a full path before we can pass it to the -T operator. We use the $DOCUMENT_ROOT variable to create the full path to the file.

The -T operator returns true if the file is a text file. After grep finishes processing all the files, @files will contain a list of all the text files.

We iterate through the @files array, setting $file to the current value each time through the loop. We proceed to open the file, making sure to return an error if we cannot open it, and iterate through it one line at a time.

The %matches hash contains three elements: file to store the name of the file, num_matches to store the number of matches, and a content array to hold all the lines containing matches. We need the filename for output purposes.

We use a simple case-insensitive regex to search for the query. The o option compiles the regex only once, which greatly improves the speed of the search. Note that this will cause problems for scripts running under mod_perl or FastCGI, which we'll discuss later in Chapter 17, "Efficiency and Optimization".

If the line contains a match, we escape characters that could be mistaken for HTML tags. We then bold the matched text, increment the match counter by the number of matches, and push that line onto that file's content array.

After we have finished looking through the files, we sort the results by the number of matches found in decreasing order and then alphabetically by path for those who have the same number of matches.

To generate our results, we walk through our sorted list. For each file, we create a link and display the number of matches and all the lines that matched the query. Since the content exists as individual elements in an array, we join all the elements together into one large string delimited by an HTML break tag.

Now, let us improve on this application a bit by allowing users to specify regular expression searches. We will not present the entire application, since it is very similar to the one we have just covered.

12.2.1. Regex-Based Search Engine

By allowing users to specify regular expressions in their search, we make the search engine much more powerful. For example, a user who wants to search for the recipe for Zwetschgendatschi (a Bavarian plum cake) from your online collection, but is not sure of the exact spelling, could simply enter Zwet.+?chi to find it.

In order to implement this functionality, we have to add several pieces to the search engine.

First, we need to modify the HTML file to provide an option for the user to turn the functionality on or off:

Regex Searching: 
    <INPUT TYPE="radio" NAME="regex" VALUE="on">On
    <INPUT TYPE="radio" NAME="regex" VALUE="off" CHECKED>Off

Then, we need to check for this value in the application and act accordingly. Here is the beginning of the new search script:

#!/usr/bin/perl -wT

use strict;

my $q     = new CGI;
my $regex = $q->param( "regex" );
my $query = $q->param( "query" );

unless ( defined $query and length $query ) {
    error( $q, "Please specify a query!" );
}

if ( $regex eq "on" ) {
    eval { /$query/o };
    error( $q, "Invalid Regex") if $@;
}
else {
    $query = quotemeta $query;
}

my $results = search( $q, $query );

print $q->header( "text/html" ),
      $q->start_html( "Simple Perl Regex Search" ),
      $q->h1( "Search for: $query" ),
      $q->ul( $results || "No matches found" ),
      $q->end_html;
.
.

The rest of the code remains the same. What we are doing differently here is checking if the user chose the "regex" option and if so, evaluating the user-specified regex at runtime using the eval function. We can check to see whether the regex is invalid by looking at the value stored in $@. Perl sets this variable if there is an error in the evaluated code. If the regex is valid, we can go ahead and use it directly, without quoting the specified metacharacters. If the "regex" option was not requested, we perform the search as before.

As you can see, both of these applications are much improved over the first one, but neither one of them is perfect. Since both of them are based on a linear search algorithm, the search process will be slow when dealing with directories that contain many files. They also search only one directory. They could be modified to recurse down through subdirectories, but that would decrease the performance even more. In the next section, we will look at an index-based approach that calls for creating a dictionary of relevant words in advance, and then searching it rather than the actual files.



Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.

This HTML Help has been published using the chm2web software.