Previous Section  < Day Day Up >  Next Section

2.2. Linux Performance Tools: CPU

Here begins our discussion of performance tools that enable you to extract information previously described.

2.2.1. vmstat (Virtual Memory Statistics)

vmstat stands for virtual memory statistics, which indicates that it will give you information about the virtual memory system performance of your system. Fortunately, it actually does much more than that. vmstat is a great command to get a rough idea of how your system performs as a whole. It tells you

  • How many processes are running

  • How the CPU is being used

  • How many interrupts the CPU receives

  • How many context switches the scheduler performs

It is an excellent tool to use to get a rough idea of how the system performs.

2.2.1.1 CPU Performance-Related Options

vmstat can be invoked with the following command line:


vmstat [-n] [-s] [delay [count]]


vmstat can be run in two modes: sample mode and average mode. If no parameters are specified, vmstat stat runs in average mode, where vmstat displays the average value for all the statistics since system boot. However, if a delay is specified, the first sample will be the average since system boot, but after that vmstat samples the system every delay seconds and prints out the statistics. Table 2-1 describes the options that vmstat accepts.

Table 2-1. vmstat Command-Line Options

Option

Explanation

-n

By default, vmstat periodically prints out the column headers for each performance statistic. This option disables that feature so that after the initial header, only performance data displays. This proves helpful if you want to import the output of vmstat into a spreadsheet.

-s

This displays a one-shot details output of system statistics that vmstat gathers. The statistics are the totals since the system booted.

delay

This is the amount of time between vmstat samples.


vmstat provides a variety of different output statistics that enable you to track different aspects of the system performance. Table 2-2 describes those related to CPU performance. The next chapter covers those related to memory performance.

Table 2-2. CPU-Specific vmstat Output

Column

Explanation

r

This is the number of currently runnable processes. These processes are not waiting on I/O and are ready to run. Ideally, the number of runnable processes would match the number of CPUs available.

b

This is the number of processes blocked and waiting for I/O to complete.

forks

The is the number of times a new process has been created.

in

This is the number of interrupts occurring on the system.

cs

This is the number of context switches happening on the system.

us

The is the total CPU time as a percentage spent on user processes (including "nice" time).

sy

The is the total CPU time as a percentage spent in system code. This includes time spent in the system, irq, and softirq state.

wa

The is the total CPU time as a percentage spent waiting for I/O.

id

The is the total CPU time as a percentage that the system is idle.


vmstat provides a good low-overhead view of system performance. Because all the performance statistics are in text form and are printed to standard output, it is easy to capture the data generated during a test and process or graph it later. Because vmstat is such a low-overhead tool, it is practical to keep it running on a console or in a window even on a very heavily loaded server when you need to monitor the health of the system at a glance.

2.2.1.2 Example Usage

As shown in Listing 2.2, if vmstat is run with no command-line parameters, it displays the average values for the statistics that it records since the system booted. This example shows that the system was nearly idle since boot, as indicated by the CPU usage columns, under us, sys, wa, and id. The CPU spent 5 percent of the time since boot on user application code, 1 percent on system code, and the rest, 94 percent sitting idle.

Listing 2.2.

[ezolt@scrffy tmp]$ vmstat

procs -----------memory---------- ---swap-- -----io---- --system--

----cpu----

r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id

wa

1  0 181024  26284  35292 503048    0    0     3     2    6      1  5  1 94  0


Although vmstat's statistics since system boot can be useful to determine how heavily loaded the machine has been, vmstat is most useful when it runs in sampling mode, as shown in Listing 2.3. In sampling mode, vmstat prints the systems statistics after the number of seconds passed with the delay parameter. It does this sampling count a number of times. The first line of statistics in Listing 2.3 contains the system averages since boot, as before, but then the periodic sample continues after that. This example shows that there is very little activity on the system. We can see that no processes were blocked during the run by looking at the 0 in the b. We can also see, by looking in the r column, that fewer than 1 processes were running when vmstat sampled its data.

Listing 2.3.

[ezolt@scrffy tmp]$ vmstat 2 5

procs  -----------memory---------- ---swap-- -----io---- --system-- ----cpu----

r  b    swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa

1  0  181024  26276  35316 502960    0    0     3     2    6     1  5  1 94  0

1  0  181024  26084  35316 502960    0    0     0     0 1318   772  1  0 98  0

0  0  181024  26148  35316 502960    0    0     0    24 1314   734  1  0 98  0

0  0  181024  26020  35316 502960    0    0     0     0 1315   764  2  0 98  0

0  0  181024  25956  35316 502960    0    0     0     0 1310   764  2  0 98  0


vmstat is an excellent way to record how a system behaves under a load or during a test condition. You can use vmstat to display how the system is behaving and, at the same time, save the result to a file by using the Linux tee command. (Chapter 8, "Utility Tools: Performance Tool Helpers," describes the tee command in more detail.) If you only pass in the delay parameter, vmstat will sample indefinitely. Just start it before the test, and interrupt it after the test has completed. The output file can be imported into a spreadsheet, and used to see how the system reacts to the load or various system events. Listing 2.4 shows the output of this technique. In this example, we can look at the interrupt and context switches that the system is generating. We can see the total number of interrupts and context switches in the in and cs columns respectively.

The number of context switches looks good compared to the number of interrupts. The scheduler is switching processes less than the number of timer interrupts that are firing. This is most likely because the system is nearly idle, and most of the time when the timer interrupt fires, the scheduler does not have any work to do, so it does not switch from the idle process.

(Note: There is a bug in the version of vmstat that generated the following output. It causes the system average line of output to display incorrect values. This bug has been reported to the maintainer of vmstat and will be fixed soon, hopefully.)

Listing 2.4.

[ezolt@scrffy ~/edid]$ vmstat 1 | tee /tmp/output

procs -----------memory---------- ---swap-- -----io----  --system-- ----cpu----

r  b   swpd   free   buff  cache   si   so    bi    bo    in    cs  us sy id wa

0  1 201060  35832  26532 324112    0    0     3     2     6     2  5  1  94  0

0  0 201060  35888  26532 324112    0    0    16     0  1138   358  0  0  99  0

0  0 201060  35888  26540 324104    0    0     0    88  1163   371  0  0 100  0

0  0 201060  35888  26540 324104    0    0     0     0  1133   345  0  0 100  0

0  0 201060  35888  26540 324104    0    0     0    60  1174   351  0  0 100  0

0  0 201060  35920  26540 324104    0    0     0     0  1150   408  0  0 100  0

[Ctrl-C]


More recent versions of vmstat can even extract more detailed information about a grab bag of different system statistics, as shown in Listing 2.5.

The next chapter discusses the memory statistics, but we look at the CPU statistics now. The first group of statistics, or "CPU ticks," shows how the CPU has spent its time since system boot, where a "tick" is a unit of time. Although the condensed vmstat output only showed four CPU states—us, sy, id, and wa—this shows how all the CPU ticks are distributed. In addition, we can see the total number of interrupts and context switches. One new addition is that of forks, which is basically the number of new processes that have been created since system boot.

Listing 2.5.

[ezolt@scrffy ~/edid]$ vmstat -s

      1034320  total memory

       998712  used memory

       698076  active memory

       176260  inactive memory

        35608  free memory

        26592  buffer memory

       324312  swap cache

      2040244  total swap

       201060  used swap

      1839184  free swap

      5279633 non-nice user cpu ticks

     28207739 nice user cpu ticks

      2355391 system cpu ticks

    628297350 idle cpu ticks

       862755 IO-wait cpu ticks

           34 IRQ cpu ticks

      1707439 softirq cpu ticks

     21194571 pages paged in

     12677400 pages paged out

        93406 pages swapped in

       181587 pages swapped out

 1931462143 interrupts

    785963213 CPU context switches

   1096643656 boot time

       578451 forks


vmstat provides a broad range of information about the performance of a Linux system. It is one of the core tools to use when investigating a problem with a system.

2.2.2. top (v. 2.0.x)

top is the Swiss army knife of Linux system-monitoring tools. It does a good job of putting a very large amount of system-wide performance information in a single screen. What you display can also be changed interactively; so if a particular problem creeps up as your system runs, you can modify what top is showing you.

By default, top presents a list, in decreasing order, of the top CPU-consuming processes. This enables you to quickly pinpoint which program is hogging the CPU. Periodically, top updates the list based on a delay that you can specify. (It is initially 3 seconds.)

2.2.2.1 CPU Performance-Related Options

top is invoked with the following command line:


top [d delay] [C] [H] [i] [n iter] [b]


top actually takes options in two modes: command-line options and runtime options. The command-line options determine how top displays its information. Table 2-3 shows the command-line options that influence the type and frequency of the performance statistics that top displays.

Table 2-3. top Command-Line Options

Option

Explanation

d delay

Delay between statistic updates.

n iterations

Number of iterations before exiting. top updates the statistics iterations times.

i

Don't display processes that aren't using any of the CPU.

H

Show all the individual threads of an application rather than just display a total for each application.

C

In a hyperthreaded or SMP system, display the summed CPU statistics rather than the statistics for each CPU.


As you run top, you might want to fine-tune what you are observing to investigate a particular problem. The output of top is highly customizable. Table 2-4 describes options that change statistics shown during top's runtime.

Table 2-4.

Option

Explanation

f or F

This displays a configuration screen that enables you to select which process statistics display on the screen.

o or O

This displays a configuration screen that enables you to change the order of the displayed statistics.


The options described in Table 2-5 turn on or off the display of various system-wide information. It can be helpful to turn off unneeded statistics to fit more processes on the screen.

Table 2-5. top Runtime Output Toggles

Option

Explanation

l

This toggles whether the load average and uptime information will be updated and displayed.

t

This toggles the display of how each CPU spends its time. It also toggles information about how many processes are currently running. Shows all the individual threads of an application instead of just displaying a total for each application.

m

This toggles whether information about the system memory usage will be shown on the screen. By default, the highest CPU consumers are displayed first. However, it might be more useful to sort by other characteristics.


Table 2-6 describes the different sorting modes that top supports. Sorting by memory consumption is particular useful to figure out which process consumes the most amount of memory.

Table 2-6. top Output Sorting/Display Options

Option

Explanation

P

Sorts the tasks by their CPU usage. The highest CPU user displays first.

T

Sorts the tasks by the amount of CPU time they have used so far. The highest amount displays first.

N

Sorts the tasks by their PID. The lowest PID displays first.

A

Sorts the tasks by their age. The newest PID is shown first. This is usually the opposite of "sort by PID."

i

Hides tasks that are idle and are not consuming CPU.


top provides system-wide information in addition to information about specific processes. Table 2-7 covers these statistics.

Table 2-7. top Performance Statistics

Option

Explanation

us

CPU time spent in user applications.

sy

CPU time spent in the kernel.

ni

CPU time spent in "nice"ed processes.

id

CPU time spent idle.

wa

CPU time spent waiting for I/O.

hi

CPU time spent in the irq handlers.

si

CPU time spent in the softirq handlers.

load average

The 1-minute, 5-minute, and 15-minute load average.

%CPU

The percentage of CPU that a particular process is consuming.

PRI

The priority value of the process, where a higher value indicates a higher priority. RT indicates that the task has real-time priority, a priority higher than the standard range.

NI

The nice value of the process. The higher the nice value, the less the system has to execute the process. Processes with high nice values tend to have very low priorities.

WCHAN

If a process is waiting on an I/O, this shows which kernel function it is waiting in.

STAT

This is the current status of a process, where the process is either sleeping (S), running (R), zombied (killed but not yet dead) (Z), in an uninterruptable sleep (D), or being traced (T).

TIME

The total amount CPU time (user and system) that this process has used since it started executing.

COMMAND

That command that this process is executing.

LC

The number of the last CPU that this process was executing on.

FLAGS

This toggles whether the load average and uptime information will be updated and displayed.


top provides a large amount of information about the different running processes and is a great way to figure out which process is a resource hog.

2.2.2.2 Example Usage

Listing 2.6 is an example run of top. Once it starts, it periodically updates the screen until you exit it. This demonstrates some of the system-wide statistics that top can generate. First, we see the load average of the system over the past 1, 5, and 15 minutes. As we can see, the system has started to get busy recently (because doom-3.x86). One CPU is busy with user code 90 percent of the time. The other is only spending ~13 percent of its time in user code. Finally, we can see that 73 of the processes are sleeping, and only 3 of them are currently running.

Listing 2.6.

catan> top

 08:09:16  up 2 days, 18:44,  4 users,  load average: 0.95, 0.44, 0.17

76 processes: 73 sleeping, 3 running, 0 zombie, 0 stopped

CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle

           total   51.5%    0.0%    3.9%   0.0%     0.0%    0.0%   44.6%

           cpu00   90.0%    0.0%    1.2%   0.0%     0.0%    0.0%    8.8%

           cpu01   13.0%    0.0%    6.6%   0.0%     0.0%    0.0%   80.4%

Mem:  2037140k av, 1132120k used,  905020k free,       0k shrd,   86220k buff

       689784k active,             151528k inactive

Swap: 2040244k av,       0k used, 2040244k free                 322648k cached





  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND

 7642 root      25   0  647M 379M  7664 R    49.9 19.0   2:58   0 doom.x86

 7661 ezolt     15   0  1372 1372  1052 R     0.1  0.0   0:00   1 top

    1 root      15   0   528  528   452 S     0.0  0.0   0:05   1 init

    2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0

    3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1

    4 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 keventd

    5 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0

    6 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1

    9 root      25   0     0    0     0 SW    0.0  0.0   0:00   0 bdflush

    7 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 kswapd

    8 root      15   0     0    0     0 SW    0.0  0.0   0:00   1 kscand

   10 root      15   0     0    0     0 SW    0.0  0.0   0:00   1 kupdated

   11 root      25   0     0    0     0 SW    0.0  0.0   0:00   0 mdrecoveryd


Now pressing F while top is running brings the configuration screen shown in Listing 2.7. When you press the keys indicated (A for PID, B for PPID, etc.), top toggles whether these statistics display in the previous screen. When all the desired statistics are selected, press Enter to return to top's initial screen, which now shows the current values of selected statistics. When configuring the statistics, all currently activated fields are capitalized in the Current Field Order line and have an asterisk (*) next to their name.

Listing 2.7.

[ezolt@wintermute doc]$ top

(press 'F' while running)

   Current Field Order: AbcDgHIjklMnoTP|qrsuzyV{EFW[X

Toggle fields with a-z, any other key to return:

* A: PID        = Process Id

  B: PPID       = Parent Process Id

  C: UID        = User Id

* D: USER       = User Name

* E: %CPU       = CPU Usage

* F: %MEM       = Memory Usage

  G: TTY        = Controlling tty

* H: PRI        = Priority

* I: NI         = Nice Value

  J: PAGEIN     = Page Fault Count

  K: TSIZE      = Code Size (kb)

  L: DSIZE      = Data+Stack Size (kb)

* M: SIZE       = Virtual Image Size (kb)

  N: TRS        = Resident Text Size (kb)

  O: SWAP       = Swapped kb

* P: SHARE      = Shared Pages (kb)

  Q: A          = Accessed Page count

  R: WP         = Write Protected Pages

  S: D          = Dirty Pages

* T: RSS        = Resident Set Size (kb)

  U: WCHAN      = Sleeping in Function

* V: STAT       = Process Status

* W: TIME       = CPU Time

* X: COMMAND    = Command

  Y: LC         = Last used CPU (expect this to change regularly)

  Z: FLAGS      = Task Flags (see linux/sched.h)


To show you how customizable top is, Listing 2.8 shows a highly configured output screen, which shows only the top options relevant to CPU usage.

Listing 2.8.

 08:16:23  up 2 days, 18:52,  4 users,  load average: 1.07, 0.92, 0.49

76 processes: 73 sleeping, 3 running, 0 zombie, 0 stopped

CPU states:  cpu    user    nice  system    irq  softirq  iowait     idle

           total   48.2%    0.0%    1.5%   0.0%     0.0%    0.0%    50.1%

           cpu00    0.3%    0.0%    0.1%   0.0%     0.0%    0.0%    99.5%

           cpu01   96.2%    0.0%    2.9%   0.0%     0.0%    0.0%     0.7%

Mem:  2037140k av, 1133548k used,  903592k free,       0k shrd,    86232k buff

690812k active,             151536k inactive

Swap: 2040244k av,       0k used, 2040244k free                 322656k cached





  PID USER     PRI  NI WCHAN        FLAGS  LC STAT %CPU  TIME CPU COMMAND

 7642 root      25   0             100100   1 R    49.6 10:30   1 doom.x86

    1 root      15   0             400100   0 S     0.0  0:05   0 init

    2 root      RT   0                140   0 SW    0.0  0:00   0 migration/0

    3 root      RT   0                140   1 SW    0.0  0:00   1 migration/1

    4 root      15   0                 40   0 SW    0.0  0:00   0 keventd

    5 root      34  19                 40   0 SWN   0.0  0:00   0 ksoftirqd/0

    6 root      34  19                 40   1 SWN   0.0  0:00   1 ksoftirqd/1

    9 root      25   0                 40   0 SW    0.0  0:00   0 bdflush

    7 root      15   0                840   0 SW    0.0  0:00   0 kswapd

    8 root      15   0                 40   0 SW    0.0  0:00   0 kscand

   10 root      15   0                 40   0 SW    0.0  0:00   0 kupdated

   11 root      25   0                 40   0 SW    0.0  0:00   0 mdrecoveryd

   20 root      15   0             400040   0 SW    0.0  0:00   0 katad-1


top provides an overview of system resource usage with a focus on providing information about how various processes are consuming those resources. It is best used when interacting with the system directly because of the user-friendly and tool-unfriendly format of its output.

2.2.3. top (v. 3.x.x)

Recently, the version of top provided by the most recent distributions has been completely overhauled, and as a result, many of the command-line and interaction options have changed. Although the basic ideas are similar, it has been streamlined, and a few different display modes have been added.

Again, top presents a list, in decreasing order, of the top CPU-consuming processes.

2.2.3.1 CPU Performance-Related Options

top is invoked with the following command line:


top [-d delay] [-n iter] [-i] [-b]


top actually takes options in two modes: command-line options and runtime options. The command-line options determine how top displays its information. Table 2-8 shows the command-line options that influence the type and frequency of the performance statistics that top will display.

Table 2-8. top Command-Line Options

Option

Explanation

-d delay

Delay between statistic updates.

-n iterations

Number of iterations before exiting. top updates the statistics' iterations times.

-i

This option changes whether or not idle processes display.

-b

Run in batch mode. Typically, top shows only a single screenful of information, and processes that don't fit on the screen never display. This option shows all the processes and can be very useful if you are saving top's output to a file or piping the output to another command for processing.


As you run top, you may want to fine-tune what you are observing to investigate a particular problem. Like the 2.x version of top, the output of top is highly customizable. Table 2-9 describes options that change statistics shown during top's runtime.

Table 2-9. top Runtime Options

Option

Explanation

A

This displays an "alternate" display of process information that shows top consumers of various system resources.

I

This toggles whether top will divide the CPU usage by the number of CPUs on the system.

 

For example, if a process was consuming all of both CPUs on a two-CPU system, this toggles whether top displays a CPU usage of 100% or 200%.

f

This displays a configuration screen that enables you to select which process statistics display on the screen.

o

This displays a configuration screen that enables you to change the order of the displayed statistics.


The options described in Table 2-10 turn on or off the display of various system-wide information. It can be helpful to turn off unneeded statistics to fit more processes on the screen.

Table 2-10. top Runtime Output Toggles

Option

Explanation

1 (numeral 1)

This toggles whether the CPU usage will be broken down to the individual usage or shown as a total.

l

This toggles whether the load average and uptime information will be updated and displayed.


top v3.x provides system-wide information in addition to information about specific processes similar to those of top v2.x. These statistics are covered in Table 2-11.

Table 2-11. top Performance Statistics

Option

Explanation

us

CPU time spent in user applications.

sy

CPU time spent in the kernel.

ni

CPU time spent in "nice"ed processes.

id

CPU time spent idle.

wa

CPU time spent waiting for I/O.

hi

CPU time spent in the irq handlers.

si

CPU time spent in the softirq handlers.

load average

The 1-minute, 5-minute, and 15-minute load average.

%CPU

The percentage of CPU that a particular process is consuming.

PRI

The priority value of the process, where a higher value indicates a higher priority. RT indicates that the task has real-time priority, a priority higher than the standard range.

NI

The nice value of the process. The higher the nice value, the less the system has to execute the process. Processes with high nice values tend to have very low priorities.

WCHAN

If a process is waiting on an I/O, this shows which kernel function it is waiting in.

TIME

The total amount CPU time (user and system) that this process has used since it started executing.

COMMAND

That command that this process is executing.

S

This is the current status of a process, where the process is either sleeping (S), running (R), zombied (killed but not yet dead) (Z), in an uninterruptable sleep (D), or being traced (T).


top provides a large amount of information about the different running processes and is a great way to figure out which process is a resource hog. The v.3 version of top has trimmed-down top and added some alternative views of similar data.

2.2.3.2 Example Usage

Listing 2.9 is an example run of top v3.0. Again, it will periodically update the screen until you exit it. The statistics are similar to those of top v2.x, but are named slightly differently.

Listing 2.9.

catan> top

top - 08:52:21 up 19 days, 21:38, 17 users,   load average: 1.06, 1.13, 1.15

Tasks: 149 total,   1 running, 146 sleeping,    1 stopped,   1 zombie

Cpu(s):  0.8% us,  0.4% sy,  4.2% ni, 94.2% id,   0.1% wa,  0.0% hi,  0.3% si

Mem:   1034320k total,  1023188k used,    11132k  free,    39920k buffers

Swap:  2040244k total,   214496k used,  1825748k  free,   335488k cached



  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM     TIME+  COMMAND

26364 root      16   0  400m  68m 321m S  3.8  6.8  379:32.04 X

26737 ezolt     15   0 71288  45m  21m S  1.9  4.5    6:32.04 gnome-terminal

29114 ezolt     15   0 34000  22m  18m S  1.9  2.2   27:57.62 gnome-system-mo

 9581 ezolt     15   0  2808 1028 1784 R  1.9  0.1    0:00.03 top

    1 root      16   0  2396  448 1316 S  0.0  0.0    0:01.68 init

    2 root      RT   0     0    0    0 S  0.0  0.0    0:00.68 migration/0

    3 root      34  19     0    0    0 S  0.0  0.0    0:00.01 ksoftirqd/0

    4 root      RT   0     0    0    0 S  0.0  0.0    0:00.27 migration/1

    5 root      34  19     0    0    0 S  0.0  0.0    0:00.01 ksoftirqd/1

    6 root      RT   0     0    0    0 S  0.0  0.0    0:22.49 migration/2

    7 root      34  19     0    0    0 S  0.0  0.0    0:00.01 ksoftirqd/2

    8 root      RT   0     0    0    0 S  0.0  0.0    0:37.53 migration/3

    9 root      34  19     0    0    0 S  0.0  0.0    0:00.01 ksoftirqd/3

   10 root       5 -10     0    0    0 S  0.0  0.0    0:01.74 events/0

   11 root       5 -10     0    0    0 S  0.0  0.0    0:02.77 events/1

   12 root       5 -10     0    0    0 S  0.0  0.0    0:01.79 events/2


Now pressing f while top is running brings the configuration screen shown in Listing 2.10. When you press the keys indicated (A for PID, B for PPID, etc.), top toggles whether these statistics display in the previous screen. When all the desired statistics are selected, press Enter to return to top's initial screen, which now shows the current values of selected statistics. When you are configuring the statistics, all currently activated fields are capitalized in the Current Field Order line and have and asterisk (*) next to their name. Notice that most of statistics are similar, but the names have slightly changed.

Listing 2.10.

(press 'f' while running)

Current Fields:  AEHIOQTWKNMbcdfgjplrsuvyzX  for window 1:Def

Toggle fields via field letter, type any other key to return





* A: PID        = Process Id                 u: nFLT       = Page Fault count

* E: USER       = User Name                  v: nDRT       = Dirty Pages count

* H: PR         = Priority                   y: WCHAN      = Sleeping in Function

* I: NI         = Nice value                 z: Flags      = Task Flags <sched.h>

* O: VIRT       = Virtual Image (kb)       * X: COMMAND    = Command name/line

* Q: RES        = Resident size (kb)

* T: SHR        = Shared Mem size (kb)     Flags field:

* W: S          = Process Status             0x00000001  PF_ALIGNWARN

* K: %CPU       = CPU usage                  0x00000002  PF_STARTING

* N: %MEM       = Memory usage (RES)         0x00000004  PF_EXITING

* M: TIME+      = CPU Time, hundredths       0x00000040  PF_FORKNOEXEC

  b: PPID       = Parent Process Pid         0x00000100  PF_SUPERPRIV

  c: RUSER      = Real user name             0x00000200  PF_DUMPCORE

  d: UID        = User Id                    0x00000400  PF_SIGNALED

  f: GROUP      = Group Name                 0x00000800  PF_MEMALLOC

  g: TTY        = Controlling Tty            0x00002000  PF_FREE_PAGES (2.5)

  j: #C         = Last used cpu (SMP)        0x00008000  debug flag (2.5)

  p: SWAP       = Swapped size (kb)          0x00024000  special threads (2.5)

  l: TIME       = CPU Time                   0x001D0000  special states (2.5)

  r: CODE       = Code size (kb)             0x00100000  PF_USEDFPU (thru 2.4)

  s: DATA       = Data+Stack size (kb)


Listing 2.11 shows the new output mode of top, where many different statistics are sorted and displayed on the same screen.

Listing 2.11.

(press 'F' while running)

1:Def - 09:00:48 up 19 days, 21:46, 17 users,  load average: 1.01, 1.06, 1.10

Tasks: 144 total,   1 running, 141 sleeping,   1 stopped,   1 zombie

Cpu(s):  1.2% us,  0.9% sy,  0.0% ni, 97.9% id,  0.0% wa,  0.0% hi,  0.0% si

Mem:   1034320k total,  1024020k used,    10300k free,    39408k buffers

Swap:  2040244k total,   214496k used,  1825748k free,   335764k cached





1  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

 29114 ezolt     16   0 34112  22m  18m S  3.6  2.2  28:15.06 gnome-system-mo

 26364 root      15   0  400m  68m 321m S  2.6  6.8  380:01.09 X

  9689 ezolt     16   0  3104 1092 1784 R  1.0  0.1    0:00.09 top

2  PID  PPID    TIME+  %CPU %MEM  PR  NI S  VIRT SWAP   RES  UID COMMAND

 30403 24989   0:00.03  0.0  0.1  15   0 S  5808 4356  1452 9336 bash

 29510 29505   7:19.59  0.0  5.9  16   0 S  125m  65m   59m 9336 firefox-bin

 29505 29488   0:00.00  0.0  0.1  16   0 S  5652 4576  1076 9336 run-mozilla.sh

3  PID %MEM  VIRT SWAP  RES CODE DATA  SHR nFLT nDRT S  PR  NI %CPU COMMAND

  8414 25.0  374m 121m 252m  496 373m  98m 1547    0 S  16   0  0.0 soffice.bin

 26364  6.8  400m 331m  68m 1696 398m 321m 2399    0 S  15   0  2.6 X

 29510  5.9  125m  65m  59m   64 125m  31m  253    0 S  16   0  0.0 firefox-bin

 26429  4.7 59760  10m  47m  404  57m  12m 1247    0 S  15   0  0.0 metacity

4  PID  PPID  UID USER     RUSER    TTY         TIME+  %CPU %MEM S COMMAND

    1371   1   43 xfs      xfs      ?          0:00.10  0.0  0.1 S xfs

    1313   1   51 smmsp    smmsp    ?          0:00.08  0.0  0.2 S sendmail

     982   1   29 rpcuser  rpcuser  ?          0:00.07  0.0  0.1 S rpc.statd

     963   1   32 rpc      rpc      ?          0:06.23  0.0  0.1 S portmap


top v3.x provides a slightly cleaner interface to top. It simplifies some aspects of it and provides a nice "summary" information screen that displays many of the resource consumers in the system.

2.2.4. procinfo (Display Info from the /proc File System)

Much like vmstat, procinfo provides a view of the system-wide performance characteristics. Although some of the information that it provides is similar to that of vmstat, it also provides information about the number of interrupts that the CPU received for each device. Its output format is a little more readable than vmstat, but it takes up much more screen space.

2.2.4.1 CPU Performance-Related Options

procinfo is invoked with the following command:


procinfo [-f] [-d] [-D] [-n sec] [-f file]


Table 2-12 describes the different options that change the output and the frequency of the samples that procinfo displays.

Table 2-12. procinfo Command-Line Options

Option

Explanation

-f

Runs procinfo in full-screen mode

-d

Displays statistics change between samples rather than totals

-D

Displays statistic totals rather than rate of change

-n sec

Number of seconds to pause between each sample

-Ffile

Sends the output of procinfo to a file


Table 2-13 shows the CPU statistics that procinfo gathers.

Table 2-13. procinfo CPU Statistics

Option

Explanation

user

This is the amount of user time that the CPU has spent in days, hours, and minutes.

nice

This is the amount of nice time that the CPU has spent in days, hours, and minutes.

system

This is the amount of system time that the CPU has spent in days, hours, and minutes.

idle

This is the amount of idle time that the CPU has spent in days, hours, and minutes.

irq 0- N

This displays the number of the irq, the amount that has fired, and which kernel driver is responsible for it.


Much like vmstat or top, procinfo is a low-overhead command that is good to leave running in a console or window on the screen. It gives a good indication of a system's health and performance.

2.2.4.2 Example Usage

Calling procinfo without any command options yields output similar to Listing 2.12. Without any options, procinfo displays only one screenful of status and then exits. procinfo is more useful when it is periodically updated using the -n second options. This enables you to see how the system's performance is changing in real time.

Listing 2.12.

[ezolt@scrffy ~/mail]$ procinfo

Linux 2.4.18-3bigmem (bhcompile@daffy) (gcc 2.96 20000731 ) #1 4CPU [scrffy]



Memory:      Total        Used        Free      Shared     Buffers       Cached

Mem:       1030784      987776       43008           0       35996       517504

Swap:      2040244       17480     2022764

Bootup: Thu Jun  3 09:20:22 2004     Load average: 0.47 0.32 0.26 1/118 10378





user  :       3:18:53.99   2.7%   page in :  1994292  disk 1:       20r       0w

nice  :       0:00:22.91   0.0%   page out:  2437543  disk 2:   247231r  131696w

system:       3:45:41.20   3.1%   swap in :      996

idle  :   4d 15:56:17.10  94.0%   swap out:     4374

uptime:   1d  5:45:18.80          context : 64608366





irq  0:  10711880 timer                  irq 12:   1319185 PS/2 Mouse

irq  1:     94931 keyboard               irq 14:   7144432 ide0

irq  2:         0 cascade [4]            irq 16:        16 aic7xxx

irq  3:         1                        irq 18:   4152504 nvidia

irq  4:         1                        irq 19:         0 usb-uhci

irq  6:         2                        irq 20:   4772275 es1371

irq  7:         1                        irq 22:    384919 aic7xxx

irq  8:         1 rtc                    irq 23:   3797246 usb-uhci, eth0


As you can see from Listing 2.12, procinfo provides a reasonable overview of the system. We can see that, once again for the user, nice, system, and idle time, the system is not very busy. One interesting thing to notice is that procinfo claims that the system has spent more idle time than the system has been running (as indicated by the uptime). This is because the system actually has four CPUs, so for every day of wall time, four days of CPU time passes. The load average confirms that the system has been relatively work-free for the recent past. For the past minute, on the average, the system had less than one process ready to run; a load average of .47 indicates that a single process was ready to run only 47 percent of the time. On a four-CPU system, this large amount of CPU power is going to waste.

procinfo also gives us a good view of what devices on the system are causing interrupts. We can see that the Nvidia card (nvidia), IDE controller (ide0), Ethernet device (eth0), and sound card (es1371) have a relatively high number of interrupts. This is as one would expect for a desktop workstation.

procinfo has the advantage of putting many of the system-wide performance statistics within a single screen, enabling you to see how the system is performing as a whole. It lacks details about network and disk performance, but it provide a good system-wide detail of the CPU and memory performance. One limitation that can be significant is the fact that procinfo does not report when the CPU is in the iowait, irq, or softirq mode.

2.2.5. gnome-system-monitor

gnome-system-monitor is, in many ways, a graphical counterpart of top. It enables you to graphically monitor individual processes and observe the load on the system based on the graphs that it displays.

2.2.5.1 CPU Performance-Related Options

gnome-system-monitor can be invoked from the Gnome menu. (Under Red Hat 9 and greater, this is under System Tools > System Monitor.) However, it can also be invoked using the following command:


gnome-system-monitor


gnome-system-monitor has no relevant command-line options that affect the CPU performance measurements. However, some of the statistics shown can be modified by selecting gnome-system-monitor's Edit > Preferences menu entry.

2.2.5.2 Example Usage

When you launch gnome-system-monitor, it creates a window similar to Figure 2-1. This window shows information about the amount of CPU and memory that a particular process is using. It also shows information about the parent/child relationships between each process.

Figure 2-1.


Figure 2-2 shows a graphical view of system load and memory usage. This is really what distinguishes gnome-system-monitor from top. You can easily see the current state of the system and how it compares to the previous state.

Figure 2-2.


The graphical view of data provided by gnome-system-monitor can make it easier and faster to determine the state of the system, and how its behavior changes over time. It also makes it easier to navigate the system-wide process information.

2.2.6. mpstat (Multiprocessor Stat)

mpstat is a fairly simple command that shows you how your processors are behaving based on time. The biggest benefit of mpstat is that it shows the time next to the statistics, so you can look for a correlation between CPU usage and time of day.

If you have multiple CPUs or hyperthreading-enabled CPUs, mpstat can also break down CPU usage based on the processor, so you can see whether a particular processor is doing more work than the others. You can select which individual processor you want to monitor or you can ask mpstat to monitor all of them.

2.2.6.1 CPU Performance-Related Options

mpstat can be invoked using the following command line:


mpstat [ -P { cpu | ALL } ] [ delay [ count ] ]


Once again, delay specifies how often the samples will be taken, and count determines how many times it will be run. Table 2-14 describes the command-line options of mpstat.

Table 2-14. mpstat Command-Line Options

Option

Explanation

-P { cpu | ALL }

This option tells mpstat which CPUs to monitor. cpu is the number between 0 and the total CPUs minus 1.

delay

This specifies how long mpstat waits between samples.


mpstat provides similar information to the other CPU performance tools, but it allows the information to be attributed to each of the processors in a particular system. Table 2-15 describes the options that it supports.

Table 2-15. mpstat CPU Statistics

Option

Explanation

user

This is the percentage of user time that the CPU has spent during the previous sample.

nice

This is the percentage of time that the CPU has spent during the previous sample running low-priority (or nice) processes.

system

This is the percentage of system time that the CPU has spent during the previous sample.

iowait

This is the percentage of time that the CPU has spent during the previous sample waiting on I/O.

irq

This is the percentage of time that the CPU has spent during the previous sample handling interrupts.

softirq

This is the percentage of time that the CPU has spent during the previous sample handling work that needed to be done by the kernel after an interrupt has been handled.

idle

This is the percentage of time that the CPU has spent idle during the previous sample.


mpstat is a good tool for providing a breakdown of how each of the processors is performing. Because mpstat provides a per-CPU breakdown, you can identify whether one of the processors is becoming overloaded.

2.2.6.2 Example Usage

First, we ask mpstat to show us the CPU statistics for processor number 0. This is shown in Listing 2.13.

Listing 2.13.

[ezolt@scrffy sysstat-5.1.1]$ ./mpstat -P 0 1 10

Linux 2.6.8-1.521smp (scrffy)   10/20/2004





07:12:02 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft   %idle     intr/s

07:12:03 PM    0    9.80    0.00    1.96    0.98    0.00    0.00   87.25    1217.65

07:12:04 PM    0    1.01    0.00    0.00    0.00    0.00    0.00   98.99    1112.12

07:12:05 PM    0    0.99    0.00    0.00    0.00    0.00    0.00   99.01    1055.45

07:12:06 PM    0    0.00    0.00    0.00    0.00    0.00    0.00  100.00    1072.00

07:12:07 PM    0    0.00    0.00    0.00    0.00    0.00    0.00  100.00    1075.76

07:12:08 PM    0    1.00    0.00    0.00    0.00    0.00    0.00   99.00    1067.00

07:12:09 PM    0    4.90    0.00    3.92    0.00    0.00    0.98   90.20    1045.10

07:12:10 PM    0    0.00    0.00    0.00    0.00    0.00    0.00  100.00    1069.70

07:12:11 PM    0    0.99    0.00    0.99    0.00    0.00    0.00   98.02    1070.30

07:12:12 PM    0    3.00    0.00    4.00    0.00    0.00    0.00   93.00    1067.00

Average:       0    2.19    0.00    1.10    0.10    0.00    0.10   96.51    1085.34


Listing 2.14 shows a similar command on very unloaded CPUs that both have hyperthreading. You can see how the stats for all the CPUs are shown. One interesting observation in this output is the fact that one CPU seems to handle all the interrupts. If the system was heavy loaded with I/O, and all the interrupts were being handed by a single processor, this could be the cause of a bottleneck, because one CPU is overwhelmed, and the rest are waiting for work to do. You would be able to see this with mpstat, if the processor handling all the interrupts had no idle time, whereas the other processors did.

Listing 2.14.

[ezolt@scrffy sysstat-5.1.1]$ ./mpstat -P ALL 1 2

Linux 2.6.8-1.521smp (scrffy)   10/20/2004





07:13:21 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft   %idle     intr/s

07:13:22 PM  all    3.98    0.00    1.00    0.25    0.00    0.00   94.78    1322.00

07:13:22 PM    0    2.00    0.00    0.00    1.00    0.00    0.00   97.00    1137.00

07:13:22 PM    1    6.00    0.00    2.00    0.00    0.00    0.00   93.00     185.00

07:13:22 PM    2    1.00    0.00    0.00    0.00    0.00    0.00   99.00       0.00

07:13:22 PM    3    8.00    0.00    1.00    0.00    0.00    0.00   91.00       0.00





07:13:22 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft   %idle     intr/s

07:13:23 PM  all    2.00    0.00    0.50    0.00    0.00    0.00   97.50    1352.53

07:13:23 PM    0    0.00    0.00    0.00    0.00    0.00    0.00  100.00    1135.35

07:13:23 PM    1    6.06    0.00    2.02    0.00    0.00    0.00   92.93     193.94

07:13:23 PM    2    0.00    0.00    0.00    0.00    0.00    0.00  101.01      16.16

07:13:23 PM    3    1.01    0.00    1.01    0.00    0.00    0.00  100.00       7.07





Average:     CPU   %user   %nice    %sys %iowait    %irq   %soft   %idle     intr/s

Average:     all    2.99    0.00    0.75    0.12    0.00    0.00   96.13    1337.19

Average:       0    1.01    0.00    0.00    0.50    0.00    0.00   98.49    1136.18

Average:       1    6.03    0.00    2.01    0.00    0.00    0.00   92.96     189.45

Average:       2    0.50    0.00    0.00    0.00    0.00    0.00  100.00       8.04

Average:       3    4.52    0.00    1.01    0.00    0.00    0.00   95.48       3.52


mpstat can be used to determine whether the CPUs are fully utilized and relatively balanced. By observing the number of interrupts each CPU is handling, it is possible to find an imbalance. Details on how to control where interrupts are routing are provided in the kernel source under Documentation/IRQ-affinity.txt.

2.2.7. sar (System Activity Reporter)

sar has yet another approach to collecting system data. sar can efficiently record system performance data collected into binary files that can be replayed at a later date. sar is a low-overhead way to record information about how the system is performing.

The sar command can be used to record performance information, replay previous recorded information, and display real-time information about the current system. The output of the sar command can be formatted to make it easy to pipe to relational databases or to other Linux commands for processing.

2.2.7.1 CPU Performance-Related Options

sar can be invoked with the following command line:


sar [options] [ delay [ count ] ]


Although sar reports about many different areas of Linux, the statistics are of two different forms. One set of statistics is the instantaneous value at the time of the sample. The other is a rate since the last sample. Table 2-16 describes the command-line options of sar.

Table 2-16. sar Command-Line Options

Option

Explanation

-c

This reports information about how many processes are being created per second.

-I {irq | SUM | ALL | XALL}

This reports the rates that interrupts have been occurring in the system.

-P {cpu | ALL}

This option specifies which CPU the statistics should be gathered from. If this isn't specified, the system totals are reported.

-q

This reports information about the run queues and load averages of the machine.

-u

This reports information about CPU utilization of the system. (This is the default output.)

-w

This reports the number of context switches that occurred in the system.

-o filename

This specifies the name of the binary output file that will store the performance statistics.

-f filename

This specifies the filename of the performance statistics.

delay

The amount of time to wait between samples.

count

The total number of samples to record.


sar offers a similar set (with different names) of the system-wide CPU performance statistics that we have seen in the proceeding tools. The list is shown in Table 2-17.

Table 2-17. sar CPU Statistics

Option

Explanation

user

This is the percentage of user time that the CPU has spent during the previous sample.

nice

This is the percentage of time that the CPU has spent during the previous sample running low-priority (or nice) processes.

system

This is the percentage of system time that the CPU has spent during the previous sample.

iowait

This is the percentage of time that the CPU has spent during the previous sample waiting on I/O.

idle

This is the percentage of time that the CPU was idle during the previous sample.

runq-sz

This is the size of the run queue when the sample was taken.

plist-sz

This is the number of processes present (running, sleeping, or waiting for I/O) when the sample was taken.

ldavg-1

This was the load average for the last minute.

ldavg-5

This was the load average for the past 5 minutes.

ldavg-15

This was the load average for the past 15 minutes.

proc/s

This is the number of new processes created per second. (This is the same as the forks statistic from vmstat.)

cswch

This is the number of context switches per second.

intr/s

The number of interrupts fired per second.


One of the significant benefits of sar is that it enables you to save many different types of time-stamped system data to log files for later retrieval and review. This can prove very handy when trying to figure out why a particular machine is failing at a particular time.

2.2.7.2 Example Usage

This first command shown in Listing 2.15 takes three samples of the CPU every second, and stores the results in the binary file /tmp/apache_test. This command does not have any visual output and just returns when it has completed.

Listing 2.15.

[ezolt@wintermute sysstat-5.0.2]$ sar -o /tmp/apache_test 1 3


After the information has been stored in the /tmp/apache_test file, we can display it in various formats. The default is human readable. This is shown in Listing 2.16. This shows similar information to the other system monitoring commands, where we can see how the processor was spending time at a particular time.

Listing 2.16.

[ezolt@wintermute sysstat-5.0.2]$ sar -f /tmp/apache_test

Linux 2.4.22-1.2149.nptl (wintermute.phil.org)  03/20/04



17:18:34          CPU     %user     %nice   %system   %iowait      %idle

17:18:35          all     90.00      0.00     10.00      0.00       0.00

17:18:36          all     95.00      0.00      5.00      0.00       0.00

17:18:37          all     92.00      0.00      6.00      0.00       2.00

Average:          all     92.33      0.00      7.00      0.00       0.67


However, sar can also output the statistics in a format that can be easily imported into a relational database, as shown in Listing 2.17. This can be useful for storing a large amount of performance data. Once it has been imported into a relational database, the performance data can be analyzed with all of the tools of a relational database.

Listing 2.17.

[ezolt@wintermute sysstat-5.0.2]$ sar -f /tmp/apache_test  -H

wintermute.phil.org;1;2004-03-20 22:18:35 UTC;-1;90.00;0.00;10.00;0.00;0.00

wintermute.phil.org;1;2004-03-20 22:18:36 UTC;-1;95.00;0.00;5.00;0.00;0.00

wintermute.phil.org;1;2004-03-20 22:18:37 UTC;-1;92.00;0.00;6.00;0.00;2.00


Finally, sar can also output the statistics in a format that can be easily parsed by standard Linux tools such as awk, perl, python, or grep. This output, which is shown in Listing 2.18, can be fed into a script that will pull out interesting events, and possibly even analyze different trends in the data.

Listing 2.18.

[ezolt@wintermute sysstat-5.0.2]$ sar -f /tmp/apache_test  -h

wintermute.phil.org     1       1079821115      all     %user   90.00

wintermute.phil.org     1       1079821115      all     %nice   0.00

wintermute.phil.org     1       1079821115      all     %system 10.00

wintermute.phil.org     1       1079821115      all     %iowait 0.00

wintermute.phil.org     1       1079821115      all     %idle   0.00

wintermute.phil.org     1       1079821116      all     %user   95.00

wintermute.phil.org     1       1079821116      all     %nice   0.00

wintermute.phil.org     1       1079821116      all     %system 5.00

wintermute.phil.org     1       1079821116      all     %iowait 0.00

wintermute.phil.org     1       1079821116      all     %idle   0.00

wintermute.phil.org     1       1079821117      all     %user   92.00

wintermute.phil.org     1       1079821117      all     %nice   0.00

wintermute.phil.org     1       1079821117      all     %system 6.00

wintermute.phil.org     1       1079821117      all     %iowait 0.00

wintermute.phil.org     1       1079821117      all     %idle   2.00


In addition to recording information in a file, sar can also be used to observe a system in real time. In the example shown in Listing 2.19, the CPU state is sampled three times with one second between them.

Listing 2.19.

[ezolt@wintermute sysstat-5.0.2]$ sar 1 3

Linux 2.4.22-1.2149.nptl (wintermute.phil.org)  03/20/04



17:27:10          CPU     %user     %nice   %system   %iowait      %idle

17:27:11          all     96.00      0.00      4.00      0.00       0.00

17:27:12          all     98.00      0.00      2.00      0.00       0.00

17:27:13          all     92.00      0.00      8.00      0.00       0.00

Average:          all     95.33      0.00      4.67      0.00       0.00


The default display's purpose is to show information about the CPU, but other information can also be displayed. For example, sar can show the number of context switches per second, and the number of memory pages that have been swapped in or out. In Listing 2.20, sar samples the information two times, with one second between them. In this case, we ask sar to show us the total number of context switches and process creations that occur every second. We also ask sar for information about the load average. We can see in this example that this machine has 163 process that are in memory but not running. For the past minute, on average 1.12 processes have been ready to run.

Listing 2.20.

[ezolt@scrffy manuscript]$ sar -w -c -q 1 2

Linux 2.6.8-1.521smp (scrffy)   10/20/2004



08:23:29 PM    proc/s

08:23:30 PM      0.00



08:23:29 PM   cswch/s

08:23:30 PM    594.00



08:23:29 PM   runq-sz  plist-sz   ldavg-1    ldavg-5  ldavg-15

08:23:30 PM         0       163      1.12       1.17      1.17



08:23:30 PM    proc/s

08:23:31 PM      0.00



08:23:30 PM   cswch/s

08:23:31 PM    812.87



08:23:30 PM   runq-sz  plist-sz   ldavg-1    ldavg-5  ldavg-15

08:23:31 PM         0       163      1.12       1.17      1.17



Average:       proc/s

Average:         0.00



Average:      cswch/s

Average:       703.98



Average:      runq-sz  plist-sz   ldavg-1    ldavg-5  ldavg-15

Average:            0       163      1.12       1.17      1.17


As you can see, sar is a powerful tool that can record many different performance statistics. It provides a Linux-friendly interface that enables you to easily extract and analyze the performance data.

2.2.8. oprofile

oprofile is a performance suite that uses the performance counter hardware available in nearly all modern processors to track where CPU time is being spent on an entire system, and individual processes. In addition to measuring where CPU cycles are spent, oprofile can measure very low-level information about how the CPU is performing. Depending on the events supported by the underlying processor, it can measure such things as cache misses, branch mispredictions and memory references, and floating-point operations.

oprofile does not record every event that occurs; instead, it works with the processor's performance hardware to sample every count events, where count is a value that users specify when they start oprofile. The lower the value of count, the more accurate the results are, but the higher the overhead of oprofile. By keeping count to a reasonable value, oprofile can run with a very low overhead but still give an amazingly accurate account of the performance of the system.

Sampling is very powerful, but be careful for some nonobvious gotchas when using it. First, sampling may say that you are spending 90 percent of your time in a particular routine, but it does not say why. There can be two possible causes for a high number of cycles attributed to a particular routine. First, it is possible that this routine is the bottleneck and is taking a long amount of time to execute. However, it may also be that the function is taking a reasonable amount of time to execute, but is called a large number of times. You can usually figure out which is the case by looking at the samples around, the particularly hot line, or by instrumenting the code to count the number of calls that are made to it.

The second problem of sampling is that you are never quite sure where a function is being called from. Even if you figure out that the function is being called many times and you track down all of the functions that call it, it is not necessarily clear which function is doing the majority of the calling.

2.2.8.1 CPU Performance-Related Options

oprofile is actually a suite of pieces that work together to collect CPU performance statistics. There are three main pieces of oprofile:

  • The oprofile kernel module manipulates the processor and turns on and off sampling.

  • The oprofile daemon collects the samples and saves them to disk.

  • The oprofile reporting tools take the collected samples and show the user how they relate to the applications running on the system.

The oprofile suite hides the driver and daemon manipulation in the opcontrol command. The opcontrol command is used to select which events the processor will sample and start the sampling.

When controlling the daemon, you can invoke opcontrol using the following command line:


opcontrol [--start] [--stop] [--dump]


This option's control, the profiling daemon, enables you to start and stop sampling and to dump the samples from the daemon's memory to disk. When sampling, the oprofile daemon stores a large amount of samples in internal buffers. However, it is only possibly to analyze the samples that have been written (or dumped) to disk. Writing to disk can be an expensive operation, so oprofile only does it periodically. As a result, after running a test and profiling with oprofile, the results may not be available immediately, and you will have to wait until the daemon flushes the buffers to disk. This can be very annoying when you want to begin analysis immediately, so the opcontrol command enables you to force the dump of samples from the oprofile daemon's internal buffers to disk. This enables you to begin a performance investigation immediately after a test has completed.

Table 2-18 describes the command-line options for the opcontrol program that enable you to control the operation of the daemon.

Table 2-18. opcontrol Daemon Control

Option

Explanation

-s/--start

Starts profiling unless this uses a default event for the current processor

-d/--dump

Dumps the sampling information that is currently in the kernel sample buffers to the disk.

--stop

This will stop the profiling.


By default, oprofile picks an event with a given frequency that is reasonable for the processor and kernel that you are running it on. However, it has many more events that can be monitored than the default. When you are listing and selecting an event, opcontrol is invoked using the following command line:


opcontrol [--list-events] [-event=:name:count:unitmask:kernel:user:]


The event specifier enables you to select which event is going to be sampled; how frequently it will be sampled; and whether that sampling will take place in kernel space, user space, or both. Table 2-19 describes the command-line option of opcontrol that enables you to select different events to sample.

Table 2-19. opcontrol Event Handling

Option

Explanation

-l/--list-events

Lists the different events that the processor can sample.

-event=:name:count: unitmask:kernel:user:

Used to specify what events will be sampled. The event name must be one of the events that the processor supports. A valid event can be retrieved from the --list- events option. The count parameter specifies that the processor will be sampled every count times that event happens. The unitmask modifies what the event is going to sample. For example, if you are sampling "reads from memory," the unit mask may allow you to select only those reads that didn't hit in the cache. The kernel parameter specifies whether oprofile should sample when the processor is running in kernel space. The user parameter specifies whether oprofile should sample when the processor is running in user space.

--vmlinux = kernel

Specifies which uncompressed kernel image oprofile will use to attribute samples to various kernel functions.


After the samples have been collected and saved to disk, oprofile provides a different tool, opreport, which enables you to view the samples that have been collected. opreport is invoked using the following command line:


opreport [-r] [-t]


Typically, opreport displays all the samples collected by the system and which executables (including the kernel) are responsible for them. The executables with the highest number of samples are shown first, and are followed by all the executables with samples. In a typical system, most of the samples are in a handful of executables at the top of the list, with a very large number of executables contributing a very small number of samples. To deal with this, opreport enables you to set a threshold, and only executables with that percentage of the total samples or greater will be shown. Alternatively, opreport can reverse the order of the executables that are shown, so those with a high contribution are shown last. This way, the most important data is printed last, and it will not scroll off the screen.

Table 2-20 describes these command-line options of opreport that enable you to format the output of the sampling.

Table 2-20. opreport Report Format

Option

Explanation

--reverse-sort / -r

Reverses the order of the sort. Typically, the images that caused the most events display first.

--threshold / -t [percentage]

Causes opreport to only show images that have contributed percentage or more amount of samples. This can be useful when there are many images with a very small number of samples and you are only interested in the most significant.


Again, oprofile is a complicated tool, and these options show only the basics of what oprofile can do. You learn more about the capabilities of oprofile in later chapters.

2.2.8.2 Example Usage

oprofile is a very powerful tool, but it can also be difficult to install. Appendix B, "Installing oprofile," contains instructions on how to get oprofile installed and running on a few of the major Linux distributions.

We begin the use of oprofile by setting it up for profiling. This first command, shown in Listing 2.21, uses the opcontrol command to tell the oprofile suite where an uncompressed image of the kernel is located. oprofile needs to know the location of this file so that it can attribute samples to exact functions within the kernel.

Listing 2.21.

[root@wintermute root]# opcontrol --vmlinux=/boot/vmlinux-\

2.4.22-1.2174.nptlsmp


After we set up the path to the current kernel, we can begin profiling. The command in Listing 2.22 tells oprofile to start sampling using the default event. This event varies depending on the processor, but the default event for this processor is CPU_CLK_UNHALTED. This event samples all of the CPU cycles where the processor is not halted. The 233869 means that the processor will sample the instruction the processor is executing every 233,869 events.

Listing 2.22.

[root@wintermute root]# opcontrol -s

Using default event: CPU_CLK_UNHALTED:233869:0:1:1

Using log file /var/lib/oprofile/oprofiled.log

Daemon started.

Profiler running.


Now that we have started sampling, we want to begin to analyze the sampling results. In Listing 2.23, we start to use the reporting tools to figure out what is happening in the system. opreport reports what has been profiled so far.

Listing 2.23.

[root@wintermute root]# opreport

opreport op_fatal_error:

No sample file found: try running opcontrol --dump

or specify a session containing sample files


Uh oh! Even though the profiling has been happening for a little while, we are stopped when opreport specifies that it cannot find any samples. This is because the opreport command is looking for the samples on disk, but the oprofile daemon stores the samples in memory and only periodically dumps them to disk. When we ask opreport for a list of the samples, it does not find any on disk and reports that it cannot find any samples. To alleviate this problem, we can force the daemon to flush the samples immediately by issuing a dump option to opcontrol, as shown in Listing 2.24. This command enables us to view the samples that have been collected.

Listing 2.24.

[root@wintermute root]# opcontrol --dump


After we dump the samples to disk, we try again, and ask oprofile for the report, as shown in Listing 2.25. This time, we have results. The report contains information about the processor that it was collected on and the types of events that were monitored. The report then lists in descending order the number of events that occurred and which executable they occurred in. We can see that the Linux kernel is taking up 50 percent of the total cycles, emacs is taking 14 percent, and libc is taking 12 percent. It is possible to dig deeper into executable and determine which function is taking up all the time, but that is covered in Chapter 4, "Performance Tools: Process-Specific CPU."

Listing 2.25.

[root@wintermute root]# opreport

CPU: PIII, speed 467.739 MHz (estimated)

Counted CPU_CLK_UNHALTED events (clocks processor is not halted)

with a unit mask of 0x00 (No unit mask) count 233869

     3190 50.4507 vmlinux-2.4.22-1.2174.nptlsmp

      905 14.3128 emacs

      749 11.8456 libc-2.3.2.so

      261  4.1278 ld-2.3.2.so

      244  3.8589 mpg321

      233  3.6850 insmod

      171  2.7044 libperl.so

      128  2.0244 bash

      113  1.7871 ext3.o

  ....


When we started the oprofile, we just used the default event that opcontrol chose for us. Each processor has a very rich set of events that can be monitored. In Listing 2.26, we ask opcontrol to list all the events that are available for this particular CPU. This list is quite long, but in this case, we can see that in addition to CPU_CLK_UNHALTED, we can also monitor DATA_MEM_REFS and DCU_LINES_IN. These are memory events caused by the memory subsystem, and we investigate them in later chapters.

Listing 2.26.

[root@wintermute root]# opcontrol -l

oprofile: available events for CPU type "PIII"





See Intel Architecture Developer's Manual Volume 3, Appendix A and

Intel Architecture Optimization Reference Manual (730795-001)





CPU_CLK_UNHALTED: (counter: 0, 1)



        clocks processor is not halted (min count: 6000)



DATA_MEM_REFS: (counter: 0, 1)



        all memory references, cachable and non (min count: 500)



DCU_LINES_IN: (counter: 0, 1)



        total lines allocated in the DCU (min count: 500)

....


The command needed to specify which events we will monitor can be cumbersome, so fortunately, we can also use oprofile's graphical oprof_start command to graphically start and stop sampling. This enables us to select the events that we want graphically without the need to figure out the exact way to specify on the command line the events that we want to monitor.

In the example of op_control shown in Figure 2-3, we tell oprofile that we want to monitor DATA_MEM_REFS and L2_LD events at the same time. The DATA_MEM_REFS event can tell us which applications use the memory subsystem a lot and which use the level 2 cache. In this particular processor, the processor's hardware has only two counters that can be used for sampling, so only two events can be used simultaneously.

Figure 2-3.


Now that we have gathered the samples using the graphical interface to operofile, we can now analyze the data that it has collected. In Listing 2.27, we ask opreport to display the profile of samples that it has collected in a similar way to how we did when we were monitoring cycles. In this case, we can see that the libmad library has 31 percent of the data memory references of the whole system and appears to be the heaviest user of the memory subsystem.

Listing 2.27.

[root@wintermute root]# opreport

CPU: PIII, speed 467.739 MHz (estimated)

Counted DATA_MEM_REFS events (all memory references, cachable and non)

with a unit mask of 0x00 (No unit mask) count 30000

Counted L2_LD events (number of L2 data loads) with a unit mask of 0x0f

(All cache states) count 233869

    87462 31.7907        17  3.8636 libmad.so.0.1.0

    24259  8.8177        10  2.2727 mpg321

    23735  8.6272        40  9.0909 libz.so.1.2.0.7

    17513  6.3656        56 12.7273 libgklayout.so

    17128  6.2257        90 20.4545 vmlinux-2.4.22-1.2174.nptlsmp

    13471  4.8964         4  0.9091 libpng12.so.0.1.2.2

    12868  4.6773        20  4.5455 libc-2.3.2.so

....


The output provided by opreport displays all the system libraries and executables that contain any of the events that we were sampling. Note that not all the events have been recorded; because we are sampling, only a subset of events are actually recorded. This is usually not a problem, because if a particular library or executable is a performance problem, it will likely cause high-cost events to happen many times. If the sampling is random, these high-cost events will eventually be caught by the sampling code.

    Previous Section  < Day Day Up >  Next Section