Section 8.6. Monitoring

8.6. Monitoring

The key to running a successful project is to be in control. System information must be regularly collected for historical and statistical purposes and allow real-time notification when something goes wrong.

8.6.1. File Integrity

One of the system security best practices demands that every machine makes use of an integrity checker, such as Tripwire, to monitor file integrity. The purpose of an integrity checker is to detect an intruder early, so you can act quickly and contain the intrusion.

As a special case, integrity checkers can be applied against the user files in the web server tree. I believe Tripwire was among the first to offer such a product, in the form of an Apache module. The product was discontinued, and the problem was probably due to the frequent changes that take place on most web sites. Of what use is a security measure that triggers the alarm daily? Besides, many web sites construct pages dynamically, with the content stored in databases, so the files on disk are not that relevant any more. Still, in a few cases where reputation is extremely important (e.g., for governments), this approach has some merit.

8.6.2. Event Monitoring

The first thing to consider when it comes to event monitoring is whether to implement real-time monitoring. Real-time monitoring sounds fancy, but unless an effort is made to turn it into a useful tool, it can do more harm than good. Imagine the following scenario:

A new application is being deployed. The web server uses mod_security to detect application-level attacks. Each time an attack is detected, the request is denied with status code 403 (forbidden), and an email message is sent to the developers. Excited, developers read every email in the beginning. After a while, with no time to verify each attack, all developers have message filters that move such notifications into a separate folder, and no one looks at them any more.

This is real-time monitoring gone bad. Real problems often go undetected because of too many false positives. A similar lesson can be learned from the next example, too:

Developers have installed a script to check the operation of the application every five minutes. When a failure is detected, the script sends an email, which generates a series of mobile phone messages to notify all team members. After some time in operation, the system breaks in the middle of the night. Up until the problem was resolved two hours later (by the developer who was on duty at that time), all five members of the development team received 25 phone messages each. Since many turned off their phones a half an hour after the problem was first detected (because they could not sleep), some subsequent problems that night went undetected.

The two cases I have just described are not something I invented to prove a point. There are numerous administrative and development teams suffering like that. These problems can be resolved by following four rules:

Funnel all events into log files: Avoid using ad-hoc notification mechanisms (application emails, scripts triggered by ErrorDocument, module actions). Instead, send all events to the error log, implement some mechanism to watch that one location, and act when necessary.
Implement notification only when necessary: Do not send notifications about attacks you have blocked. Notifications should serve to inform others about real problems. A good example of a required real-time notification is an SQL query failure. Such an event is a sign of a badly written application or an attacker practicing SQL injection. Either way, it must be addressed immediately.
Replace real-time monitoring with periodic reporting: Have a script write an activity report every night. Better, create some nice graphs out of it, and assign someone to examine the reports and graphs first thing in the morning. This will help keep an eye on those events you are not actively verifying.
Use adequate tools if you decide to go real time: Event correlation tools (one of which is described below) will do the hard work for you, filtering out events you do not care about and only disturbing your peace in real trouble.

8.6.2.1 Periodic reporting

One way to implement periodic monitoring is to use the concept of Artificial Ignorance invented by Marcus J. Ranum. (The original email message on the subject is at http://www.ranum.com/security/computer_security/papers/ai/.) The process starts with raw logs and goes along the following lines:

Remove "noisy" linesi.e., the lines you know are safe to ignore.
Remove certain parts that are or may be unique for every entry (e.g., the time/stamp or the remote IP address).
Alphabetically sort the lines.
Replace multiple identical lines with a single copy but prefix each such line with the number of occurrences. Prefix each line that occurs only once with the number 1.
Sort the output in descending order, thereby showing the lines that occurred most frequently first.

The idea is to uncover a specific type of event, but without the specifics. The numerical value is used to assess the seriousness of the situation. Here is the same logic implemented as a Perl script (I call it error_log_ai) that you can use:

#!/usr/bin/perl -w
   
# loop through the lines that are fed to us
while (defined($line = <STDIN>)) {
   
    # ignore "noisy" lines
    if (!( ($line =~ /Processing config/)
        || ($line =~ /Server built/)
        || ($line =~ /suEXEC/) )) {
   
        # remove unique features of log entries
        $line =~ s/^\[[^]]*\] //;
        $line =~ s/\[client [^]]*\] //;
        $line =~ s/\[unique_id [^]]*\]//;
        $line =~ s/child pid [0-9]*/child pid X/;
        $line =~ s/child process [0-9]*/child process X/;
   
        # add to the list for later
        push(@lines, $line);
    }
}
   
@lines = sort @lines;
   
# replace multiple occurences of the same line
$count = 0;
$prevline = "";
foreach $line (@lines) {
    next if ($line =~ /^$/);
   
    if (!($line eq $prevline)) {
        if ($count != 0) {
            $prefix = sprintf("%5i", $count);
            push @outlines, "$prefix $prevline";
        }
        $count = 1;
        $prevline = $line;
    } else {
        $count++;
    }
}
undef @lines;
   
@outlines = sort @outlines;
print "--httpd begin------\n";
print reverse @outlines;
print "--httpd end--------\n";

The script is designed to take input from stdin and send output to stdout, so it is easy to use it on the command line with any other script:

# cat error_log | error_log_ai.pl | mail ivanr@webkreator.com

From the following example of daily output, you can see how a long error log file was condensed into a few lines that can tell you what happened:

--httpd begin------
  38 [notice] child pid X exit signal Segmentation fault (11)
  32 [info] read request line timed out
  24 [error] File does not exist: /var/www/html/403.php
  19 [warn] child process X did not exit, sending another SIGHUP
   6 [notice] Microsoft-IIS/5.0 configured -- resuming normal operations
   5 [notice] SIGHUP received.  Attempting to restart
   4 [error] File does not exist: /var/www/html/test/imagetest.GIF
   1 [info] read request headers timed out
--httpd end  ------

8.6.2.2 Swatch

Swatch (http://swatch.sourceforge.net) is a program designed around Perl and regular expressions. It monitors log files for events and evaluates them against expressions in its configuration file. Incoming events are evaluated against positive (take action on event) and negative (ignore event) regular expressions. Positive matches result in one or more actions taking place.

A Swatch configuration file designed to detect DoS attacks by examining the error log could look like this:

# Ignore requests with 404 responses
ignore /File not found/
   
# Notify me by email about mod_security events
# but not more than once every hour
watchfor /mod_security/
    throttle 1:00:00
    mail ivanr@webkreator.com,subject=Application attack
   
# Notify me by email whenever the server
# runs out of processes - could be a DoS attack
watchfor /MaxClients reached/
    mail ivanr@webkreator.com,subject=DOS attack

Swatch is easy to learn and use. It does not offer event correlation, but it does offer the throttle keyword (used in the previous example), which prevents too many actions from taking place.

8.6.2.3 Simple Event Correlator

Simple Event Correlator (SEC, available from http://www.estpak.ee/~risto/sec/) is the tool to use when you want to implement a really secure system. Do not let the word "simple" in the name fool you; SEC is a very powerful tool. Consequently, it can be a bit difficult to configure.

It works on the same principles as Swatch, but it keeps track of events and uses that information when evaluating future events. I will give a few examples of SEC to demonstrate its capabilities.

SEC is based around several types of rules, which are applied to events. The rule types and their meanings are:

Single: Match specified event and execute specified action.
SingleWithScript: Match specified event and call external script to decide whether to take action.
SingleWithSuppress: Match specified event, execute specified action, and ignore the same events during a given time period.
Pair: Match specified event and execute specified action, but ignore the following events of the same definition until some other specific event arrives. Execute another action when it does.
PairWithWindow: Match specified event, and wait for another specific event to arrive. Execute one action if that event arrives within a given period of time or execute another if it doesn't.
SingleWithThreshold: Count events of a specified type and execute specified action if a given threshold is exceeded.
SingleWith2Thresholds: Count events of a specified type and execute specified action if a given threshold is exceeded. Execute another action if the count falls below the threshold in the following specified time period.
Suppress: Suppress matching for a given event.
Calendar: Execute specified action at a given time.

Do not worry if this looks confusing. Read it a couple of times and it will start to make sense. I have prepared a couple of examples to put the rules above in the context of what we do here.

The following two rules cause SEC to wait for a nightly backup and alert the administrator if it does not happen:

# At 01:59 start waiting for the backup operation
# that takes place at 02:00 every night. The time is
# in a standard cron schedule format.
type = Calendar 
time = 59 1 * * *
desc = WAITING FOR BACKUP
action = event %s
   
# This rule will be triggered by the previous rule
# it will wait for 31 minutes for the backup to
# arrive, and notify the administrator if it doesn't
type = PairWithWindow
ptype = SubStr
pattern = WAITING FOR BACKUP
desc = BACKUP FAILED
action = shellcmd notify.pl "%s"
ptype2 = SubStr 
pattern2 = BACKUP COMPLETED
desc2 = BACKUP COMPLETED
action2 = none
window = 1860

The following rule counts the number of failed login attempts and notifies the administrator should the number of attempts become greater than six in the last hour. The shell script could also be used to disable login completely from that IP address.

type = SingleWithThreshold
ptype = RegExp
pattern = LOGIN FAILED, IP=([0-9.]+)
window = 3600
thresh = 6
desc = Login failed from IP: $1
action = shellcmd notify.pl "Too many login attempts from: $1"

SEC uses the description of the event to distinguish between series of events. Because I have included the IP address in the preceding description, the rule, in practice, monitors each IP address. Therefore, it may be a good idea to add another rule to watch the total number of failed login attempts during a time interval:

type = SingleWithThreshold
ptype = RegExp
pattern = LOGIN FAILED, IP=([0-9.]+)
window = 3600
thresh = 24
desc = Login failed (overall)
action = shellcmd notify.pl "Too many login attempts"

This rule would detect a distributed brute-force hacking attempt.

8.6.3. Web Server Status

In an ideal world, you would monitor your Apache installations via a Network Management System (NMS) as you would monitor other network devices and applications. However, Apache does not support Simple Network Management Protocol (SNMP). (There is a commercial version of the server, Covalent Apache, that does.) There are two third-party modules that implement limited SNMP functionality:

mod_snmp, at http://www.mod-snmp.com (Apache 1 only)
Mod-Apache-Snmp, at http://eplx.homeip.net/mod_apache_snmp/english/index.htm (Apache 2 only)

My experiences with these modules are mixed. The last time I tried mod_snmp, it turned out the patch did not work well when applied to recent Apache versions.

In the absence of reliable SNMP support, we will have to use the built-in module mod_status for server monitoring. Though this module helps, it comes at a cost of us having to build our own tools to automate monitoring. The good news is that I have built the tools, which you can download from the book's web site.

The configuration code for mod_status is probably present in your httpd.conf file (unless you have created the configuration file from scratch). Find and uncomment the code, replacing the YOUR_IP_ADDRESS placeholder with the IP address (or range) from which you will be monitoring the server:

# increase information presented
ExtendedStatus On
   
<Location /server-status>
    SetHandler server-status
    Order Deny,Allow
    Deny from all
    # you don't want everyone to see what
    # the web server is doing
    Allow from YOUR_IP_ADDRESS
</Location>

When the location specified above is opened in a browser from a machine that works from the allowed range you get the details of the server status. The Apache Foundation has made their server status public (via http://www.apache.org/server-status/), and since their activity is more interesting than anything I have, I used it for the screenshot shown in Figure 8-1.

Figure 8-1. mod_status gives server status information

There is plenty of information available; you can even see which requests are being executed at that moment. This type of output can be very useful for troubleshooting, but it does not help us with our primary requirement, which is monitoring. Fortunately, if the string ?auto is appended to the URL, a different type of output is produced. The example screenshot is given in Figure 8-2. This type of output is easy to parse with a computer program.

Figure 8-2. Machine-parsable mod_status output variant

In the following sections, we will build a Perl program that collects information from a web server and stores the information in an RRD file. We will discuss another Perl program that can produce fancy activity graphs. Both programs are available from the web site for this book.

RRDtool (http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/) is a tool created by Tobi Oetiker and designed to store large quantities of data but never run out of space. Each RRD file is configured with the amount of data it needs to store and the maximum amount of time it will store the samples. At first, the preallocated space is used; when that runs out new data is written over the oldest data in the file. RRDtool is also very popular because of its powerful graphing capabilities.

8.6.3.1 Fetching and storing statistics

We need to understand what data we have available. Looking at the screenshot (Figure 8-2), the first nine fields are easy to spot since each is presented on its own line. Then comes the scoreboard, which lists all processes (or threads) and tells us what each process is doing. The legend can be seen in the first screenshot, Figure 8-1. The scoreboard is not useful to us in the given format but we can count how many times each activity occurs in the scoreboard and create 10 more variables for storing this information. Therefore, we have a total of 19 variables that contain information obtained from the mod_status machine-parsable output.

First, we write the part of the Perl program that fetches and parses the mod_status output. By relying on existing Perl libraries for HTTP communication, our script can work with proxies, support authentication, and even access SSL-protected pages. The following code fetches the page specified by $url:

# fetch the page
my $ua = new LWP::UserAgent;
$ua->timeout(30);
$ua->agent("apache-monitor/1.0");
   
my $request = HTTP::Request->new(GET => $url);
my $response = $ua->request($request);

Parsing the output is fairly simple. Watch out for the incompatibility between the mod_status output in Apache 1 and Apache 2.

# Fetch the named fields first
# Set the results associative array. Each line in the file
# results in an element in the array. Each element
# has a key that is the text preceding the colon in a line 
# of the file, and a value that is whatever appears after
# any whitespace after the colon on that line.
my %results = split/:\s*|\n/, $response->content;
   
# There is a slight incompatibility between
# Apache 1 and Apache 2, so the following makes
# the results consistent between the versions. Apache 2 uses
# the term "BusyWorkers" where Apache 1 uses "BusyServers".
if ($results{"BusyServers"}) {
    $results{"BusyWorkers"} = $results{"BusyServers"};
    $results{"IdleWorkers"} = $results{"IdleServers"};
}
   
# Count the occurrences of certain characters in the scoreboard
# by using the translation operator to find and replace each
# particular character (with itself) and return the number of
# replacements.
$results{"s_ _"} = $results{"Scoreboard"} =~ tr/_/_/;
$results{"s_s"} = $results{"Scoreboard"} =~ tr/S/S/;
$results{"s_r"} = $results{"Scoreboard"} =~ tr/R/R/;
$results{"s_w"} = $results{"Scoreboard"} =~ tr/W/W/;
$results{"s_k"} = $results{"Scoreboard"} =~ tr/K/K/;
$results{"s_d"} = $results{"Scoreboard"} =~ tr/D/D/;
$results{"s_c"} = $results{"Scoreboard"} =~ tr/C/C/;
$results{"s_l"} = $results{"Scoreboard"} =~ tr/L/L/;
$results{"s_g"} = $results{"Scoreboard"} =~ tr/G/G/;
$results{"s_i"} = $results{"Scoreboard"} =~ tr/I/I/;

After writing this code, I realized some of the fields mod_status gave me were not very useful. ReqPerSec, BytesPerSec, and BytesPerReq are calculated over the lifetime of the server and practically remain constant after a certain time period elapses. To get around this problem, I decided to keep the output from the previous run and manually create the statistics by comparing the values of the Total Accesses and Total kBytes fields, as appropriate, in relation to the amount of time between runs. The code for doing this can be seen in the program (apache-monitor) on the book's web site.

Next, we store the data into an RRD file so that it can be processed by an RRD tool. We need to test to see if the desired RRD file (specified by $rrd_name in the following) exists and create it if it does not:

if (! -e $rrd_name) {
  # create the RRD file since it does not exist
  RRDs::create($rrd_name,
    # store data at 60 second intervals
    "-s 60",
    # data fields. Each line defines one data source (DS)
    # that stores the measured value (GAUGE) at maximum 10 minute
    # intervals (600 seconds), and takes values from zero.
    # to infinity (U).
    "DS:totalAccesses:GAUGE:600:0:U",
    "DS:totalKbytes:GAUGE:600:0:U",
    "DS:cpuLoad:GAUGE:600:0:U",
    "DS:uptime:GAUGE:600:0:U",
    "DS:reqPerSec:GAUGE:600:0:U",
    "DS:bytesPerSec:GAUGE:600:0:U",
    "DS:bytesPerReq:GAUGE:600:0:U",
    "DS:busyWorkers:GAUGE:600:0:U",
    "DS:idleWorkers:GAUGE:600:0:U",
    "DS:sc_ _:GAUGE:600:0:U",
    "DS:sc_s:GAUGE:600:0:U",
    "DS:sc_r:GAUGE:600:0:U",
    "DS:sc_w:GAUGE:600:0:U",
    "DS:sc_k:GAUGE:600:0:U",
    "DS:sc_d:GAUGE:600:0:U",
    "DS:sc_c:GAUGE:600:0:U",
    "DS:sc_l:GAUGE:600:0:U",
    "DS:sc_g:GAUGE:600:0:U",
    "DS:sc_i:GAUGE:600:0:U",
    # keep 10080 original samples (one week of data,
    # since one sample is made every minute)
    "RRA:AVERAGE:0.5:1:10080", 
    # keep 8760 values calculated by averaging every
    # 60 original samples (Each calculated value is one
    # day so that comes to one year.) 
    "RRA:AVERAGE:0.5:60:8760" 
  }
);

Finally, we add the data to the RRD file:

RRDs::update($rrd_name, $time
    . ":" . $results{"Total Accesses"}
    . ":" . $results{"Total kBytes"}
    . ":" . $results{"CPULoad"}
    . ":" . $results{"Uptime"}
    . ":" . $results{"ReqPerSec"}
    . ":" . $results{"BytesPerSec"}
    . ":" . $results{"BytesPerReq"}
    . ":" . $results{"BusyWorkers"}
    . ":" . $results{"IdleWorkers"}
    . ":" . $results{"s_ _"}
    . ":" . $results{"s_s"}
    . ":" . $results{"s_r"}
    . ":" . $results{"s_w"}
    . ":" . $results{"s_k"}
    . ":" . $results{"s_d"}
    . ":" . $results{"s_c"}
    . ":" . $results{"s_l"}
    . ":" . $results{"s_g"}
    . ":" . $results{"s_i"}
);

8.6.3.2 Graphing

Creating graphs from the information stored in the RRD file is the really fun part of the operation. Everyone loves the RRDtool because no skills are required to produce fabulous graphs. For example, the Perl code below creates a graph of the number of active and idle servers throughout a designated time period, such as the third graph shown in Figure 8-3. The graph is stored in a file specified by $pic_name.

RRDs::graph($pic_name,
    "-v Servers",
    "-s $start_time",
    "-e $end_time",
    # extracts the busyWorkers field from the RRD file
    "DEF:busy=$rrd_name:busyWorkers:AVERAGE",
    # extracts the idleWorkers field from the RRD file
    "DEF:idle=$rrd_name:idleWorkers:AVERAGE",
    # draws a filled area in blue
    "AREA:busy#0000ff:Busy servers",
    # draws a line in green
    "LINE2:idle#00ff00:Idle servers"
);

Figure 8-3. Graphs representing web server activity

I decided to create four graphs out of the available data:

Hits per second
Bytes transferred per second
Active and idle servers (workers in Apache 2 terminology)
Process activity (scoreboard)

The graphs are shown in Figure 8-3. You may want to create other graphs, such as ones showing the uptime and the CPU load. Note: The live view of the web server statistics for apache.org are available at http://www.apachesecurity.net/stats/, where they will remain for as long as the Apache Foundation keeps their mod_status output public.

8.6.3.3 Using the scripts

Two scripts, parts of which were shown above, are used to record the statistics and create graphs. Both are available from the web site for this book. One script, apache-monitor, fetches statistics from a server and stores them. It expects two parameters. The first specifies the (RRD) file in which the results should be stored, and the second specifies the web page from which server statistics are obtained. Here is a sample invocation:

$ apache-monitor /var/www/stats/apache.org http://www.apache.org/server-status/

For a web page that requires a username and password, you can embed these directly in the URL (e.g., http://username:password@www.example.com/server-status/). The script is smart enough to create a new RRD file if one does not exist. To get detailed statistics of the web server activity, configure cron to execute this script once a minute.

The second script, apache-monitor-graph, draws graphs for a given RRD file. It needs to know the path to the RRD file (given as the first parameter), the output folder (the second parameter), and the duration in seconds for the time period the graphs need to cover (the third parameter). The script calculates the starting time by deducting the given duration from the present time. The following invocation will create graphs for the last six hours:

$ apache-monitor-graph /var/www/stats/apache.org /var/www/stats/ 21600

Four files will be created and stored in the output folder, each showing a single graph:

$ cd /var/www/stats
$ ls
apache.org_servers-21600.gif
apache.org_hits-21600.gif
apache.org_transfer-21600.gif
apache.org_scoreboard-21600.gif

You will probably want to create several graphs to monitor the activity over different time periods. Use the values in seconds from Table 8-9.

Table 8-9. Duration of frequently used time periods

Period

Value in seconds

Hour

3600

Six hours

21600

Day

86400

Week

604800

Month

2592000

Year

31536000

Calling the graphing script every five minutes is sufficient. Having created the graphs, you only need to create some HTML code to glue them together if you want to show multiple graphs on a single page (see Figure 8-3).

The mod_status output is useful, but the figures it offers can be unreliable under some circumstances, making this approach inappropriate whenever accurate numbers are r equired. The totals are calculated by combining the values kept by individual Apache processes. This works fine if the processes keep running. But if a process exits for any reason (it may crash or be configured to exit normally after serving a certain number of requests), then a part of history disappears with it. This may lead to the seemingly impossible situation of having the request number decrease in time.

8.6.3.4 mod_watch

mod_status was designed to allow for web server monitoring. If you need more granularity, you will have to turn to mod_watch, a third-party module available from http://www.snert.com/mod_watch/. This module can provide information for an unlimited number of contexts, where each context can be one of the following:

Virtual host
File owner
Remote IP address
Directory
Location
Web server

For each context, mod_watch provides the following values:

Bytes in
Bytes out
Number of requests
Number of documents
Number of active connections
Average transfer rate in the last five minutes

Since this module comes with utility scripts to integrate it with MRTG (a monitoring and graphing tool described at http://people.ee.ethz.ch/~oetiker/webtools/mrtg/), it can be of great value if MRTG has been deployed.