Book Home Java Servlet Programming Search this book

13.3. Using Regular Expressions

If you're a servlet programmer with a background in Perl-based CGI scripting and you're still smitten with Perl's regular expression capabilities, this section is for you. Here we show how to use Perl 5 regular expressions from within Java. For those of you who are unfamiliar with regular expressions, they are a mechanism for allowing extremely advanced string manipulation with minimal code. Regular expressions are wonderfully explained in all their glory in the book Mastering Regular Expressionsby Jeffrey E. F. Friedl (O'Reilly).

With all the classes and capabilities Sun has added in JDK 1.1 and JDK 1.2, one feature still absent is a regular expression engine. Ah, well, not to worry. As with most Java features, if you can't get it from Sun, a third-party vendor is probably offering what you need at a reasonable price.

Several companies offer full-featured regular expression engines. One of the first was Thought, Inc., which developed VanillaSearch. It's available for trial download and purchase at http://www.thoughtinc.com. More recently, Original Reusable Objects, Inc. has come out with a product called OROMatcher (along with a utility package built using OROMatcher called PerlTools). These products are available for download at http://www.oroinc.com. A binary license to use OROMatcher and PerlTools is being offered absolutely free. Support, source, and "mere" redistribution (that is, as added value to an IDE) cost extra.

13.3.1. Improving Deblink with Regular Expressions

To demonstrate the use of regular expressions, let's use OROMatcher and PerlTools to rewrite the Deblink servlet originally shown in Chapter 2, "HTTP Servlet Basics". As you may recall, Deblink acted as a filter to remove the <BLINK> and </BLINK> tags from HTML pages. The original Deblink code is shown in Example 13-4 to help refresh your memory.

Example 13-4. The original Deblink

import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;

public class Deblink extends HttpServlet {

  public void doGet(HttpServletRequest req, HttpServletResponse res) 
                               throws ServletException, IOException {

    String contentType = req.getContentType();  // get the incoming type
    if (contentType == null) return;  // nothing incoming, nothing to do
    res.setContentType(contentType);  // set outgoing type to be incoming type

    PrintWriter out = res.getWriter();

    BufferedReader in = req.getReader();

    String line = null;
    while ((line = in.readLine()) != null) {
      line = replace(line, "<BLINK>", "");
      line = replace(line, "</BLINK>", "");
      out.println(line);
    }
  }

  public void doPost(HttpServletRequest req, HttpServletResponse res)
                                throws ServletException, IOException {
    doGet(req, res);
  }

  private String replace(String line, String oldString, String newString) {
    int index = 0;
    while ((index = line.indexOf(oldString, index)) >= 0) {
      // Replace the old string with the new string (inefficiently)
      line = line.substring(0, index) +
             newString +
             line.substring(index + oldString.length());
      index += newString.length();
    }
    return line;
  }
}

As we pointed out in Chapter 2, "HTTP Servlet Basics", this version of Deblink has one serious limitation: it's case sensitive. It won't remove <blink>, </blink>, <Blink>, or </Blink>. Sure, we could enumerate inside Deblink all the case combinations that should be removed, but regular expressions provide a much simpler alternative.

With a single regular expression, we can rewrite Deblink to remove the opening and closing blink tags, no matter how they are capitalized. The regular expression we'll use is "</?blink>". This matches both <blink> and </blink>. (The ? character means the previous character is optional.) With a case-insensitive mask set, this expression also matches <BLINK>, </Blink>, and even <bLINK>. Any occurrence of this regular expression can be replaced with the empty string, to completely deblink an HTML page. The rewritten Deblink code appears in Example 13-5.

Example 13-5. Deblink rewritten using regular expressions

import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;

import com.oroinc.text.perl.*;  // PerlTools package

public class Deblink extends HttpServlet {

  Perl5Util perl = new Perl5Util();

  public void doGet(HttpServletRequest req, HttpServletResponse res)
                               throws ServletException, IOException {

    String contentType = req.getContentType();  // get the incoming type
    if (contentType == null) return;  // nothing incoming, nothing to do
    res.setContentType(contentType);  // set outgoing type to be incoming type

    PrintWriter out = res.getWriter();

    BufferedReader in = req.getReader();

    try {
      String line = null;
      while ((line = in.readLine()) != null) {
        if (perl.match("#</?blink>#i", line))
          line = perl.substitute("s#</?blink>##ig", line);
        out.println(line);
      }
    }
    catch(MalformedPerl5PatternException e) { // only thrown during development
      log("Problem compiling a regular expression: " + e.getMessage());
    }
  }

  public void doPost(HttpServletRequest req, HttpServletResponse res)
                                throws ServletException, IOException {
    doGet(req, res);
  }
}

The most important lines of this servlet are the lines that replace our "</?blink>" expression with the empty string:

if (perl.match("#</?blink>#i", line))
  line = perl.substitute("s#</?blink>##ig", line);

The first line does a case-insensitive search for the regular expression </?blink>. The syntax is exactly like Perl. It may look slightly unfamiliar, though, because we chose to use hash marks instead of slashes to avoid having to escape the slash that's part of the expression (which would result in "/<\\/?blink>/i"). The trailing "i" indicates the regular expression is case insensitive.

The second line substitutes all occurrences of the regular expression with the empty string. This line alone would accomplish the same as both lines together, but it's more efficient to do the check first. The syntax is also identical to Perl. The text between the first pair of hashes is the regular expression to search for. The text between the second pair is the replacement text. The trailing "g" indicates that all occurrences should be replaced (the default is one replacement per line).

For more information on what can be done with regular expressions in Java, see the documentation that comes with each of the third-party products.



Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.

This HTML Help has been published using the chm2web software.