5.2. Memory Performance Tools

This section examines the various memory performance tools that enable you to investigate how a given application is using the memory subsystem, including the amount and different types of memory that a process is using, where it is being allocated, and how effectively the process is using the processor's cache.

5.2.1. ps

ps is an excellent command to track a process's dynamic memory usage. In addition to the CPU statistics already mentioned, ps gives detailed information about the amount of memory that the application is using and how that memory usage affects the system.

5.2.1.1 Memory Performance-Related Options

ps has many different options and can retrieve many different statistics about the state of a running application. As you saw in the previous chapter, ps can retrieve information about the CPU that a process is spending, but it also can retrieve information about the amount and type of memory that a process is using. It can be invoked with the following command line:


ps [-o vsz,rss,tsiz,dsiz,majflt,minflt,pmem,command] <PID>

Table 5-1 describes the different types of memory statistics that ps can display for a given PID.

Table 5-1. ps Command-Line Options
Option
Explanation
-o <statistic>
Enables you to specify exactly what process statistics you want to track. The different statistics are specified in a comma-separated list with no spaces.
vsz
Statistic: The virtual set size is the amount of virtual memory that the application is using. Because Linux only allocated physical memory when an application tries to use it, this value may be much greater than the amount of physical memory the application is using.
rss
Statistic: The resident set size is the amount of physical memory the application is currently using.
tsiz
Statistic: Text size is the virtual size of the program code. Once again, this isn't the physical size but rather the virtual size; however, it is a good indication of the size of the program.
dsiz
Statistic: Data size is the virtual size of the program's data usage. This is a good indication of the size of the data structures and stack of the application.
majflt
Statistic: Major faults are the number of page faults that caused Linux to read a page from disk on behalf of the process. This may happen if the process accessed a piece of data or instruction that remained on the disk and Linux loaded it seamlessly for the application.
minflt
Statistic: Minor faults are the number of faults that Linux could fulfill without resorting to a disk read. This might happen if the application touches a piece of memory that has been allocated by the Linux kernel. In this case, it is not necessary to go to disk, because the kernel can just pick a free piece of memory and assign it to the application.
pmep
Statistic: The percentage of the system memory that the process is consuming.
command
Statistic: This is the command name.

As mentioned in the preceding chapter, ps is flexible in regards to how you select the group of PIDs for which statistics display. ps —help provides information on how to specify different groups of PIDs.

5.2.1.2 Example Usage

Listing 5.1 shows the burn test application running on the system. We ask ps to tell us information about the memory statistics of the process.

Listing 5.1.


[ezolt@wintermute tmp]$ ps –o vsz,rss,tsiz,dsiz,majflt,minflt,cmd 10882

  VSZ  RSS TSIZ DSIZ MAJFLT MINFLT CMD

11124 10004   1 11122    66   2465 ./burn

As Listing 5.1 shows, the burn application has a very small text size (1KB), but a very large data size (11,122KB). Of the total virtual size (11,124KB), the process has a slightly smaller resident set size (10,004KB), which represents the total amount of physical memory that the process is actually using. In addition, most of the faults generated by burn were minor faults, so most of the memory faults were due to memory allocation rather than loading in a large amount of text or data from the program image on the disk.

5.2.2. /proc/<PID>

The Linux kernel provides a virtual file system that enables you to extract information about the processes running on the system. The information provided by the /proc file system is usually only used by performance tools such as ps to extract performance data from the kernel. Although it is not normally necessary to dig through the files in /proc, it does provide some information that you cannot retrieve with other performance tools. In addition to many other statistics, /proc provides information about a process's use of memory and mapping of libraries.

5.2.2.1 Memory Performance-Related Options

The interface to /proc is straightforward. /proc provides many virtual files that you can cat to extract their information. Each running PID in the system has a subdirectory in /proc. Within this subdirectory is a series of files containing information about that PID. One of these files, status, provides information about the status of a given process PID. You can retrieve this information by using the following command:


cat /proc/<PID>/status

Table 5-2 describes the memory statistics displayed in the status file.

Table 5-2. /proc/<PID>/status Field Description
Option
Explanation
VmSize
This is the process's virtual set size, which is the amount of virtual memory that the application is using. Because Linux only allocates physical memory when an application tries to use it, this value may be much greater than the amount of physical memory the application is actually using. This is the same as the vsz parameter provided by ps.
VmLck
This is the amount of memory that has been locked by this process. Locked memory cannot be swapped to disk.
VmRSS
This is the resident set size or amount of physical memory the application is currently using. This is the same as the rss statistic provided by ps.
VmData
This is the data size or the virtual size of the program's data usage. Unlike ps dsiz statistic, this does not include stack information.
VmStk
This is the size of the process's stack.
VmExe
This is the virtual size of the executable memory that the program has. It does not include libraries that the process is using.
VmLib
This is the size of the libraries that the process is using.

Another one of the files present in the <PID> directory is the maps file. This provides information about how the process's virtual address space is used. You can retrieve it by using the following command:


cat /proc/<PID>/maps

Table 5-3 describes the fields shown in the maps file.

Table 5-3. /proc/<PID>/maps Field Description
Option
Explanation
Address
This is the address range within the process where the library is mapped.
Permissions
These are the permissions of the memory region, where r = read, w = write, x = execute, s = shared, and p = private (copy on write).
Offset
This is the offset into the library/application where the memory region mapping begins.
Device
This is the device (minor and major number) where this particular file exists.
Inode
This is the inode number of the mapped file.
Pathname
This is the path name of the file that is mapped into the process.

The information that /proc provides can help you understand how an application is allocating memory and which libraries it is using.

5.2.2.2 Example Usage

Listing 5.2 shows the burn test application running on the system. First, we use ps to find the PID (4540) of burn. Then we extract the process's memory statistics using the /proc status file.

Listing 5.2.


[ezolt@wintermute tmp]$ ps aux | grep burn

ezolt     4540  0.4  2.6 11124 10004 pts/0   T    08:26   0:00 ./burn

ezolt     4563  0.0  0.1  1624  464 pts/0    S    08:29   0:00 grep burn



[ezolt@wintermute tmp]$ cat /proc/4540/status

Name:   burn

State:  T (stopped)

Tgid:   4540

Pid:    4540

PPid:   1514

TracerPid:      0

Uid:    501     501     501     501

Gid:    501     501     501     501

FDSize: 256

Groups: 501 9 502

VmSize:    11124 kB

VmLck:         0 kB

VmRSS:     10004 kB

VmData:     9776 kB

VmStk:         8 kB

VmExe:         4 kB

VmLib:      1312 kB

SigPnd: 0000000000000000

ShdPnd: 0000000000000000

SigBlk: 0000000000000000

SigIgn: 0000000000000000

SigCgt: 0000000000000000

CapInh: 0000000000000000

CapPrm: 0000000000000000

CapEff: 0000000000000000

As Listing 5.2 shows, once again we see that the burn application has a very small text size (4KB) and stack size (8KB), but a very large data size (9,776KB) and a reasonably sized library size (1,312KB). The small text size means that the process does not have much executable code, whereas the moderate library size means that it is using a library to support its execution. The small stack size means that the process is not calling deeply nested functions or is not calling functions that use large or many temporary variables. The VmLck size of 0KB means that the process has not locked any pages into memory, making them unswappable. The VmRSS size of 10,004KB means that the application is currently using 10,004KB of physical memory, although it has either allocated or mapped the VmSize or 11,124KB. If the application begins to use the memory that it has allocated but is not currently using, the VmRSS size increases but leaves the VmSize unchanged.

As noted previously, the application's VmLib size is nonzero, so it is using a library. In Listing 5.3, we look at the process's maps to see the exact libraries it is using.

Listing 5.3.


[ezolt@wintermute test_app]$ cat /proc/4540/maps

08048000-08049000 r-xp 00000000 21:03 393730      /tmp/burn

08049000-0804a000 rw-p 00000000 21:03 393730      /tmp/burn

0804a000-089d3000 rwxp 00000000 00:00 0

40000000-40015000 r-xp 00000000 21:03 1147263     /lib/ld-2.3.2.so

40015000-40016000 rw-p 00015000 21:03 1147263     /lib/ld-2.3.2.so

4002e000-4002f000 rw-p 00000000 00:00 0

4002f000-40162000 r-xp 00000000 21:03 2031811     /lib/tls/libc-2.3.2.so

40162000-40166000 rw-p 00132000 21:03 2031811     /lib/tls/libc-2.3.2.so

40166000-40168000 rw-p 00000000 00:00 0

bfffe000-c0000000 rwxp fffff000 00:00 0

As you see in Listing 5.3, the burn application is using two libraries: ld and libc. The text section (denoted by the permission r-xp) of libc has a range of 0x4002f000 through 0x40162000 or a size of 0x133000 or 1,257,472 bytes.

The data section (denoted by permission rw-p) of libc has a range of 40162000 through 40166000 or a size of 0x4000 or 16,384 bytes. The text size of libc is bigger than ld's text size of 0x15000 or 86,016 bytes. The data size of libc is also bigger than ld's text size of 0x1000 or 4,096 bytes. libc is the big library that burn is linking in.

/proc proves to be a useful way to extract performance statistics directly from the kernel. Because the statistics are text based, you can use the standard Linux tools to access them.

5.2.3. memprof

memprof is a graphical memory-usage profiling tool. It shows how a program is allocating memory as it runs. memprof shows the amount of memory your application is consuming and which functions are responsible for the memory consumption. In addition, memprof can show which code paths are responsible for memory usage. For example, if a function foo() allocates no memory, but calls a function bar(), which allocates a large amount of memory, memprof shows you how much foo() itself used and all the functions that foo() called. memprof updates this information dynamically as the application is running.

5.2.3.1 Memory Performance-Related Options

memprof is a graphical application, but has a few command-line options that modify its execution. It is invoked with the following command:


memprof [--follow-fork] [--follow-exec] application

memprof profiles the given "application" and creates a graphical display of its memory usage. Although memprof can be run on any application, it can provide more information if the application and the libraries that it relies on are compiled with debugging symbols.

Table 5-4 describes the options that manipulate the behavior of memprof if it is monitoring an application that calls fork or exec. This normally happens when an application launches a new process or executes a new command.

Table 5-4. memprof Command-Line Options
Option
Explanation
--follow-fork
This option will cause memprof to launch a new window for the newly forked process.
--follow-exec
This option will cause memprof to continue profiling an application after it has called exec.

Once invoked, memprof creates a window with a series of menus and options that enable you to select an application that you are going to profile.

5.2.3.2 Example Usage

Suppose that I have the example code in Listing 5.4 and I want to profile it. In this application, which I call memory_eater, the function foo() does not allocate any memory, but it calls the function bar(), which does.

Listing 5.4.


#include <stdlib.h>

void bar(void)

{

  malloc(10000);

}



void foo(void)

{

  int i;

  for (i=0; i<100;i++)

    bar();

}



int main()

{

  foo();

  while(1);

}

After compiling this application with the -g3 flag (so that the application has symbols included), we use memprof to profile this application:


[ezolt@localhost example]$ memprof ./memory_eater memintercept (3965):

_MEMPROF_SOCKET = /tmp/memprof.Bm1AKu memintercept (3965): New process,

operation = NEW, old_pid = 0

memprof creates the application window shown in Figure 5-1. As you can see, it shows memory usage information about the memory_eater application, as well as a series of buttons and menus that enable you to manipulate the profile.

Figure 5-1.

If you click the Profile button, memprof shows the memory profile of the application. The first information box in Figure 5-2 shows how much memory each function is consuming (denoted by "self"), as well as the sum of the memory that the function and its children are consuming (denoted by "total"). As expected, the foo() function does not allocate any memory, so its self value is 0, whereas its total value is 100,000, because it is calling a function that does allocate memory.

Figure 5-2.

The children and callers information boxes change as you click different functions in the top box. This way, you can see which functions of an application are using memory.

memprof provides a way to graphically traverse through a large amount of data about memory allocation. It provides an easy way to determine the memory allocation of a given function and each functions that it calls.

5.2.4. valgrind (cachegrind)

valgrind is a powerful tool that enables you to debug tricky memory-management errors. Although valgrind is mainly a developer's tool, it also has a "skin" that can show processor cache usage. valgrind simulates the current processor and runs the application in this virtual processor while tracking memory usage. It can also simulate the processor cache and pinpoint where a program is hitting and missing in the instruction and data caches.

Although very useful, its cache statistics are inexact (because valgrind is only a simulation of the processor rather than an actual piece of hardware). valgrind will not account for cache misses normally caused by system calls to the Linux kernel or cache misses that happen because of context switching. In addition, valgrind runs applications at a much slower speed than a natively executing program. However, valgrind provides a great first approximation of the cache usage of an application. valgrind can be run on any executable; if the program has been compiled with symbols (passed as -g3 to gcc when compiling), however, it will be able to pinpoint the exact line of code responsible for the cache usage.

5.2.4.1 Memory Performance-Related Options

When using valgrind to analyze cache usage for a particular application, there are two phases: collection and annotation. The collection phase starts with the following command line:


valgrind --skin=cachegrind application

valgrind is a flexible tool that has a few different "skins" that allow it to perform different types of analysis. During the collection phase, valgrind uses the cachegrind skin to collect information about cache usage. The application in the preceding command line represents the application to profile. The collection phase prints summary information to the screen, but it also saves more detailed statistics in a file named cachegrind.out.pid, where pid is the PID of the profiled application as it was running. When the collection phase is complete, the command cg_annoate is used to map the cache usage back to the application source code. cg_annote is invoked in the following way:


cg_annotate --pid [--auto=yes|no]

cg_annotate takes the information generated by valgrind and uses it to annotate the application that was profiled. The --pid option is required, where pid is the PID of the profile that you are interested in. By default, cg_annotate just shows cache usage at the function level. If you set --auto=yes, cache usage displays at the source-line level.

5.2.4.2 Example Usage

This example shows valgrind (v2.0) running on a simple application. The application clears a large area of memory and then calls two functions, a() and b(), and each touches this memory. Function a() touches the memory ten times as often as function b().

First, as shown in Listing 5.5, we run valgrind on the application using the cachegrind skin.

Listing 5.5.


[ezolt@wintermute test_app]$  valgrind --skin=cachegrind ./burn

==25571== Cachegrind, an I1/D1/L2 cache profiler for x86-linux.

==25571== Copyright (C) 2002-2003, and GNU GPL'd, by Nicholas Nethercote.

==25571== Using valgrind-2.0.0, a program supervision framework for x86-linux.

==25571== Copyright (C) 2000-2003, and GNU GPL'd, by Julian Seward.

==25571== Estimated CPU clock rate is 468 MHz

==25571== For more details, rerun with: -v

==25571==

==25571==

==25571== I   refs:      11,317,111

==25571== I1  misses:           215

==25571== L2i misses:           214

==25571== I1  miss rate:        0.0%

==25571== L2i miss rate:        0.0%

==25571==

==25571== D   refs:       6,908,012 (4,405,958 rd + 2,502,054  wr)

==25571== D1  misses:     1,412,821  (1,100,287 rd +    312,534 wr)

==25571== L2d misses:       313,810  (    1,276 rd +    312,534 wr)

==25571== D1  miss rate:       20.4% (     24.9%   +       12.4%  )

==25571== L2d miss rate:        4.5% (      0.0%   +       12.4%  )

==25571==

==25571== L2 refs:        1,413,036  (1,100,502 rd +    312,534 wr)

==25571== L2 misses:        314,024  (    1,490 rd +    312,534 wr)

==25571== L2 miss rate:         1.7% (      0.0%   +       12.4%  )

In the run of Listing 5.5, the application executed 11,317,111 instructions; this is shown by the I refs statistics. The process had an astonishingly low number of misses in both the L1 (215) and L2 (216) instruction cache, as denoted by an I1 and L2i miss rate of 0.0 percent. The process had a total number of 6,908,012 data references, 4,405,958 were reads and 2,502,054 were writes. 24.9 percent of the reads and 12.4 percent of the writes could not be satisfied by the L1 cache. Luckily, we can almost always satisfy the reads in the L2 data, and they are shown to have a miss rate of 0 percent. The writes are still a problem with a miss rate of 12.4 percent. In this application, memory access of the data is the problem to investigate.

The ideal application would have a very low number of instruction cache and data cache misses. To eliminate instruction cache misses, it may be possible to recompile the application with different compiler options or trim code, so that the hot code does not have to share icache space with the code that is not used often. To eliminate data cache misses, use arrays for data structures rather than linked lists, if possible, and reduce the size of elements in data structures, and access memory in a cache-friendly way. In any event, valgrind helps to point out which accesses/data structures should be optimized. This application run summary shows that data accesses are the main problem.

As shown in Listing 5.5, this command displays cache usage statistics for the overall run. However, when developing an application, or investigating a performance problem, it is often more interesting to see where cache misses happen rather than just the totals during an application's runtime. To determine which functions are responsible for the cache misses, we run cg_annotate, as shown in Listing 5.6. This shows us which functions are responsible for which cache misses. As we expect, the function a() has 10 times (1,000,000) the misses of the function b() (100,000).

Listing 5.6.


[ezolt@wintermute test_app]$ cg_annotate --25571

----------------------------------------------------------------------------------------

I1 cache:         16384 B, 32 B, 4-way associative

D1 cache:         16384 B, 32 B, 4-way associative

L2 cache:        131072 B, 32 B, 4-way associative

Command:         ./burn

Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

Events shown:    Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

Event sort order:Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

Thresholds:      99 0 0 0 0 0 0 0 0

Include dirs:

User annotated:

Auto-annotation: off

----------------------------------------------------------------------------------------

        Ir I1mr I2mr        Dr      D1mr  D2mr        Dw    D1mw    D2mw

----------------------------------------------------------------------------------------

11,317,111  215  214 4,405,958 1,100,287 1,276 2,502,054 312,534 312,534  PROGRAM TOTALS





----------------------------------------------------------------------------------------

       Ir I1mr I2mr        Dr      D1mr D2mr        Dw    D1mw    D2mw  file:function

----------------------------------------------------------------------------------------

8,009,011    2    2 4,003,003 1,000,000  989     1,004       0        0  burn.c:a

2,500,019    3    3         6         1    1 2,500,001 312,500 312,500   ???:__GI_memset

  800,911    2    2   400,303   100,000    0       104       0        0  burn.c:b

Although a per-function breakdown of cache misses is useful, it would be interesting to see which lines within the application are actually causing the cache misses. If we use the --auto option, as demonstrated in Listing 5.7, cg_annotate tells us exactly which line is responsible for each miss.

Listing 5.7.


[ezolt@wintermute test_app]$ cg_annotate --25571 --auto=yes

----------------------------------------------------------------------------------------

I1 cache:         16384 B, 32 B, 4-way associative

D1 cache:         16384 B, 32 B, 4-way associative

L2 cache:         131072 B, 32 B, 4-way associative

Command:          ./burn

Events recorded:  Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

Events shown:     Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw

Thresholds:       99 0 0 0 0 0 0 0 0

Include dirs:

User annotated:

Auto-annotation:  on

----------------------------------------------------------------------------------------

        Ir I1mr I2mr        Dr      D1mr  D2mr        Dw    D1mw    D2mw

----------------------------------------------------------------------------------------

11,317,111  215  214 4,405,958 1,100,287 1,276 2,502,054 312,534 312,534  PROGRAM TOTALS





----------------------------------------------------------------------------------------

       Ir I1mr I2mr        Dr      D1mr D2mr        Dw    D1mw    D2mw  file:function

----------------------------------------------------------------------------------------

8,009,011    2    2 4,003,003 1,000,000  989     1,004       0       0  burn.c:a

2,500,019    3    3         6         1    1 2,500,001 312,500 312,500  ???:__GI_memset

  800,911    2    2   400,303   100,000    0       104       0       0  burn.c:b





----------------------------------------------------------------------------------------

-- Auto-annotated source: burn.c

----------------------------------------------------------------------------------------

       Ir I1mr I2mr        Dr      D1mr D2mr    Dw D1mw D2mw





-- line 2 ----------------------------------------

        .    .    .         .         .    .     .    .    .

        .    .    .         .         .    .     .    .    .  #define ITER 100

        .    .    .         .         .    .     .    .    .  #define SZ 10000000

        .    .    .         .         .    .     .    .    .  #define STRI 10000

        .    .    .         .         .    .     .    .    .

        .    .    .         .         .    .     .    .    .  char test[SZ];

        .    .    .         .         .    .     .    .    .

        .    .    .         .         .    .     .    .    .  void a(void)

        3    0    0         .         .    .     1    0    0  {

        2    0    0         .         .    .     2    0    0    int i=0,j=0;

    5,004    1    1     2,001	      0    0     1    0    0    for(j=0;j<10*ITER ; j++)

5,004,000    0    0 2,001,000         0    0 1,000    0    0      for(i=0;i<SZ;i=i+STRI)

        .    .    .         .         .    .     .    .    .        {

3,000,000    1    1 2,000,000 1,000,000  989     .    .    .    test[i]++;

        .    .    .         .         .    .     .    .    .        }

        2    0    0         2         0    0     .    .     .   }

        .    .    .         .         .    .     .    .     .

        .    .    .         .         .    .     .    .     .   void b(void)

        3    1    1         .         .    .     1    0     0   {

        2    0    0         .         .    .     2    0     0     int i=0,j=0;

      504    0    0       201         0    0     1    0     0     for (j=0;j<ITER; j++)

  500,400    1    1   200,100         0    0   100    0     0       for (i=0;i<SZ;i=i+STRI)

        .    .    .         .         .    .     .    .     .         {

  300,000    0    0   200,000   100,000    0     .    .   .       test[i]++;

        .    .    .         .         .    .     .    .     .         }

        2    0    0         2         0    0     .    .     .   }

        .    .    .         .         .    .     .    .     .

        .    .    .         .         .    .     .    .     .

        .    .    .         .         .    .     .    .     .   main()

        6    2    2         .         .    .     1    0     0   {

        .    .    .         .         .    .     .    .     .

        .    .    .         .         .    .     .    .     .     /* Arbitrary value*/

        6    0    0         .         .    .     4    0     0     memset(test, 42, SZ);

        1    0    0         .         .    .     1    0     0     a();

        1    0    0         .         .    .     1    0     0     b();

        2    0    0         2         1    1     .    .     .   }





-----------------------------------------------------------------------------------------

Ir I1mr I2mr  Dr D1mr D2mr Dw D1mw D2mw

-----------------------------------------------------------------------------------------

78    3    3 100  100   78  0    0    0  percentage of events annotated

As Listing 5.7 shows, we have a line-by-line breakdown of where different cache misses and hits are occurring. We can see that the inner for loops have almost all the data references. As we expect, the for loop in the a() function has ten times those of the b() function.

The different level of detail (program level, function level, and line level) that valgrind/cachegrind provides can give you a good idea of which parts of an application are accessing memory and effectively using the processor caches.

5.2.5. kcachegrind

kcachegrind works intimately with valgrind to provide detailed information about the cache usage of a profiled application. It adds two new pieces of functionality above that of standard valgrind. First, it provides a skin for valgrind, called calltree, that captures both cache and call-tree statistics for a particular application. Second, it provides a graphical exploration the cache performance information and innovative visuals of the data.

5.2.5.1 Memory Performance-Related Options

Similar to valgrind, when using kcachegrind to analyze cache usage for a particular application, there are two phases: collection and annotation. The collection phase starts with the following command line:


calltree application

The calltree command accepts many different options to manipulate the information to be collected. Table 5-5 shows some of the more important options.

Table 5-5. calltree Command-Line Options
Option
Explanation
--help
This provides a brief explanation of all the different collection methods that calltree supports.
-dump-instr=yes|no
This is the amount of memory that has been locked by this process. Locked memory cannot be swapped to disk.
--trace-jump=yes|no
This includes branching information, or information about which path is taken at each branch.

calltree can record many different statistics. Refer to calltree's help option for more details.

During the collection phase, valgrind uses the calltree skin to collect information about cache usage. The application in the preceding command line represents the application to profile. During the collection phase, calltree prints summary information to the screen and, identically to cachegrind, it also saves more detailed statistics in a file named cachegrind.out.pid, where pid is the PID of the profiled application as it was running.

When the collection phase is complete, the command kcachegrind is used to map the cache usage back to the application source code. kcachegrind is invoked in the following way:


kcachegrind cachegrind.out.pid

kcachegrind displays the cache profiles statistics that have been collected and enables you to navigate through the results.

5.2.5.2 Example Usage

The first step in using kcachegrind is to compile the application with symbols to allow sample-to-source line mappings. This is done with the following command:


[ezolt@wintermute test_app]$ gcc -o burn burn.c -g3

Next, run calltree against that application, as shown in Listing 5.8. This provides output similar to cachegrind, but, most importantly, it generates a cachegrind.out file, which will be used by kcachegrind.

Listing 5.8.


[ezolt@wintermute test_app]$ calltree --dump-instr=yes --trace-jump=yes ./burn

==12242== Calltree-0.9.7, a call-graph generating cache profiler for x86-linux.

==12242== Copyright (C) 2002-2004, and GNU GPL'd, by N.Nethercote and J.Weidendorfer.

==12242== Using valgrind-2.0.0, a program supervision framework for x86-linux.

==12242== Copyright (C) 2000-2003, and GNU GPL'd, by Julian Seward.

==12242== Estimated CPU clock rate is 469 MHz

==12242== For more details, rerun with: -v

==12242==

==12242==

==12242== I   refs:       33,808,151

==12242== I1  misses:            216

==12242== L2i misses:            215

==12242== I1  miss rate:         0.0%

==12242== L2i miss rate:         0.0%

==12242==

==12242== D   refs:       29,404,027  (4,402,969 rd + 25,001,058 wr)

==12242== D1  misses:      4,225,324  (1,100,290 rd +  3,125,034 wr)

==12242== L2d misses:      4,225,324  (1,100,290 rd +  3,125,034 wr)

==12242== D1  miss rate:        14.3% (     24.9%   +       12.4% )

==12242== L2d miss rate:        14.3% (     24.9%   +       12.4% )

==12242==

==12242== L2 refs:         4,225,540  (1,100,506 rd +  3,125,034 wr)

==12242== L2 misses:       4,225,539  (1,100,505 rd +  3,125,034 wr)

==12242== L2 miss rate:          6.6%  (     2.8%   +       12.4% )

When we have the cachegrind.out file, we can start kcachegrind (v.0.54) to analyze the data by using the following command:


[ezolt@wintermute test_app]$ kcachegrind cachegrind.out.12242

This brings up the window shown in Figure 5-3. The window shows a flat profile of all the cache misses in the left pane. By default, data read misses from the L1 cache are shown.

Figure 5-3.

Next, in Figure 5-4, in the upper-right pane, we can see a visualization of the callee map, or all the functions (a() and b()) that the function in the left pane (main) calls. In the lower-right pane, we can see the application's call graph.

Figure 5-4.

Finally, in Figure 5-5, we select a different function to examine in the left pane. We also select a different event to examine (instruction fetches) using the upper-right pane. Finally, we can visualize the loops in the assembly code using the lower-right pane.

Figure 5-5.

These examples barely scratch the surface of what kcachegrind can do, and the best way to learn about it is to try it. kcachegrind is an extraordinarily useful tool for those who like to investigate performance issues visually.

5.2.6. oprofile (III)

As you have seen in previous chapters, oprofile is a powerful tool that can help to determine where application time is spent. However, oprofile can also work with the processor's performance counters to provide an accurate view of how it performs. Whereas cachegrind simulates the CPU to find cache misses/hits, oprofile uses the actual CPU performance counters to pinpoint the location of cache misses. Unlike cachegrind, oprofile does not simulate the processor, so cache effects that it records are real; those caused by the operating system will be visible. In addition, an application analyzed under oprofile will run at near native speeds, whereas cachegrind takes much longer. However, cachegrind is easier to set up and use than oprofile, and it can always track the same events no matter what version of x86 processor it runs on. cachegrind is a good tool to use when you need a quick and pretty accurate answer, whereas oprofile proves useful when you need more accurate statistics or need statistics that cachegrind does not provide.

5.2.6.1 Memory Performance-Related Options

This discussion of oprofile does not add any new command-line options because they have already been described in section outlining CPU performance. However, one command becomes more important as you start to sample events different from oprofile's defaults. Different processors and architectures can sample different sets of events and oprofile's op_help command displays the list of events that your current processor supports.

5.2.6.2 Example Usage

As mentioned previously, the events that oprofile can monitor are processor specific, so these examples are run on my current machine, which is a Pentium-III. On the Pentium-III, we can use performance counters exposed by oprofile to gather the similar information that was provided by valgrind with the cachegrind skin. This uses the performance counter hardware rather than software simulation. Using the performance hardware presents two pitfalls. First, we must deal with the underlying hardware limitations of the performance counters. On the Pentium-III, oprofile can only measure two events simultaneously, whereas cachegrind could measure many types of memory events simultaneously. This means that for oprofile to measure the same events as cachegrind, we must run the application multiple times and change the events that oprofile monitors during each run. The second pitfall is the fact that oprofile does not provide an exact count of the events like cachegrind, only samples the counters, so we will be able to see where the events most likely occur, but we will not be able to see the exact number. In fact, we will only receive (1/sample rate) number of samples. If an application only causes an event to happen a few times, it might not be recorded at all. Although it can be frustrating to not know the exact number of events when debugging a performance problem, it is usually most important to figure the relative number of samples between code lines. Even though you will not be able to directly determine the total number of events that occurred at a particular source line, you will be able to figure out the line with most events, and that is usually enough to start debugging a performance problem. This inability to retrieve exact event counts is present in every form of CPU sampling and will be present on any processor performance hardware that implements it. However, it is really this limitation that allows this performance hardware to exist at all; without it, the performance monitoring would cause too much overhead.

oprofile can monitor many different types of events, which vary on different processors. This section outlines some of the important events on the Pentium-III for an example. The Pentium-III can monitor the L1 data cache (called the DCU on the Pentium-III) by using the following events (events and descriptions provided by op_help):


[ezolt@localhost book]$ op_help

oprofile: available events for CPU type "PIII"



See Intel Architecture Developer's Manual Volume 3, Appendix A and Intel

Architecture Optimization Reference Manual (730795-001)

....



DCU_LINES_IN:     total lines allocated in the DCU

DCU_M_LINES_IN:   number of M state lines allocated in DCU

DCU_M_LINES_OUT:  number of M lines evicted from the DCU

DCU_MISS_OUTSTANDING: number of cycles while DCU miss outstanding

...

Notice that we do not have an exact correspondence to what cachegrind provided, which was the number of "reads and writes" to the L1 data cache. However, we can figure out how many cycles of L1 data misses each function had by using the DCU_LINES_IN event. Although this event does not tell us the exact number of misses that each had, it should tell us how much each function missed in the cache relative to each other. The events monitoring the L2 data cache are a little closer, but it still does not have an exact correspondence to what cachegrind was providing. Here are the relevant L2 data cache events for the Pentium-III:


[ezolt@localhost book]$ op_help

.....

L2_LD: number of L2 data loads

        Unit masks

        ----------

        0x08: (M)odified cache state

        0x04: (E)xclusive cache state

        0x02: (S)hared cache state

        0x01: (I)nvalid cache state

        0x0f: All cache states



L2_ST:  number of L2 data stores

        Unit masks

        ----------

        0x08: (M)odified cache state

        0x04: (E)xclusive cache state

        0x02: (S)hared cache state

        0x01: (I)nvalid cache state

        0x0f: All cache states



L2_LINES_IN: number of allocated lines in L2

....

The Pentium-III actually supports many more than these, but these are the basic load and store events. (cachegrind calls a load a "read" and a store a "write.") We can use these events to count the number of loads and stores that happened at each source line. We can also use the L2_LINES_IN to show which pieces of code had more L2 cache misses. As mentioned earlier, we will not get an exact value. In addition, on the Pentium-III, both instruction and data share the L2 cache. Any L2 misses could be the result of instruction or data misses. oprofile shows us the cache misses that occurred in the L2 as a result of instructions and data misses, whereas cachegrind helpfully splits it up for us.

On the Pentium-III, there are a similar series of events to monitor the instruction cache. For the L1 instruction cache (what the Pentium-III calls the "IFU"), we can measure the number of reads (or fetches) and misses directly by using the following events:


[ezolt@localhost book]$ op_help

...

IFU_IFETCH: number of non/cachable instruction fetches



IFU_IFETCH_MISS: number of instruction fetch misses

...

We can also measure the number of instructions that were fetched from the L2 cache by using the following event:


[ezolt@localhost book]$ op_help

...

L2_IFETCH: (counter: 0, 1)

        number of L2 instruction fetches (min count: 500)

        Unit masks

        ----------

        0x08: (M)odified cache state

        0x04: (E)xclusive cache state

        0x02: (S)hared cache state

        0x01: (I)nvalid cache state

        0x0f: All cache states

...

Unfortunately, as mentioned previously, the processor shares the L2 cache with the instruction and data, so there is no way to distinguish cache misses from data usage and instruction usage. We can use these events to approximate the same information that cachegrind provided.

If it is so difficult to extract the same information, why use oprofile at all? First, oprofile is much lower overhead (< 10 percent) and can be run on production applications. Second, oprofile can extract the exact events that are happening. This is much better than a possibly inaccurate simulator that does not take into account cache usage by the operating system or other applications.

Although these events are available on the Pentium-III, they are not necessarily available on any other processors. Each family of Intel and AMD processors has a different set of events that can be used to reveal different amounts of information about performance of the memory subsystem. op_help shows the events that can be monitored, but it might require reading Intel's or AMD's detailed processor information to understand what they mean.

To understand how you can use oprofile to extract cache information, compare the cache usage statistics of an application using both the virtual CPU of cachegrind and the actual CPU with oprofile. Let's run the burn program that we have been using as an example. Once again, it is an application that executes ten times the number of instructions in function a() as in function b(). It also accesses ten times the amount of data in function a() as in function b(). Here is the output of cachegrind for this demo application:


---------------------------------------------------------------------------------------------

         Ir I1mr I2mr          Dr        D1mr  D2mr        Dw    D1mw    D2mw

---------------------------------------------------------------------------------------------

883,497,211  215  214 440,332,658 110,000,288 1,277 2,610,954 312,534 312,533  PROGRAM TOTALS





---------------------------------------------------------------------------------------------

         Ir I1mr I2mr          Dr        D1mr D2mr        Dw    D1mw    D2mw  file:function

---------------------------------------------------------------------------------------------

800,900,011    2    2 400,300,003 100,000,000  989   100,004       0       0  ???:a

 80,090,011    2    2  40,030,003  10,000,000    0    10,004       0       0  ???:b

In the next few examples, I run the data collection phase of oprofile multiple times. I use oprof_start to set up the events for the particular run, and then run the demo application. Because my CPU has only two counters, I have to do this multiple times. This means that different sets of events will be monitored during the different executions of the program. Because my application does not change how it executes from each run, each run should produce similar results. This is not necessarily true for a more complicated application, such as a Web server or database, each of which can dramatically change how it executes based on the requests made to it. However, for the simple test application, it works just fine.

After I collect this sampling information, we use opreport to extract the collected information. As shown in Listing 5.9, we query oprofile about the amount of data memory references that were made and how many times there was a miss in the L1 data CPU (DCU). As cachegrind told us, ten times the number of both memory references and L1 data misses occur in function a().

Listing 5.9.


[ezolt@wintermute test_app]$ opreport -l ./burn

event:DATA_MEM_REFS,DCU_LINES_IN



CPU: PIII, speed 467.74 MHz (estimated)



Counted DATA_MEM_REFS events (all memory references, cachable and non)

with a unit mask of 0x00 (No unit mask) count 233865



Counted DCU_LINES_IN events (total lines allocated in the DCU) with a

unit mask of 0x00 (No unit mask) count 23386



vma      samples  %           samples  %           symbol name



08048348 3640     90.8864     5598     90.9209     a

0804839e 365       9.1136     559       9.0791     b

Now look at Listing 5.10, which shows opreport examining similar information for the instruction cache. Notice that the instructions executed in function a() are ten times the number as those in function b(), as we expect. However, notice that the number of L1 I misses differ from what cachegrind predicts. It is most likely that other applications and the kernel are polluting the cache and causing burn to miss in the icache. (Remember that cachegrind does not take kernel or other application cache usage into account.) This also probably happened with the data cache, but because the number of data cache misses caused by the application were so high, the extra events were lost in the noise.

Listing 5.10.


[ezolt@wintermute test_app]$ opreport -l ./burn

event:IFU_IFETCH,IFU_IFETCH_MISS

CPU: PIII, speed 467.74 MHz (estimated)

Counted IFU_IFETCH events (number of non/cachable instruction fetches)

with a unit mask of 0x00 (No unit mask) count 233870

Counted IFU_IFETCH_MISS events (number of instruction fetch misses) with

a unit mask of 0x00 (No unit mask) count 500

vma      samples  %           samples  %           symbol name

08048348 8876     90.9240     14       93.3333      a

0804839e 886       9.0760     1         6.6667      b

As you can see when comparing the output of cachegrind and oprofile, using oprofile to gather information about memory information is powerful because oprofile is low overhead and uses the processor hardware directly, but it can be difficult to find events that match those that you are interested in.

5.2.7. ipcs

ipcs is a tool that shows information about the system-wide interprocess communication memory. Processes can allocate system-wide shared memory, semaphores, or memory queues that can be shared by multiple processes running on the system. ipcs is best used to track down which applications are allocating and using large amounts of shared memory.

5.2.7.1 Memory Performance-Related Options

ipcs is invoked with the following command line:


ipcs [-t] [-c] [-l] [-u] [-p]

If ipcs is invoked without any parameters, it gives a summary of all the shared memory on the system. This includes information about the owner and size of the shared memory segment. Table 5-6 describes options that cause ipcs to display different types information about the shared memory in the system.

Table 5-6. ipcs Command-Line Options
Option
Explanation
-t
This shows the time when the shared memory was created, when a process last attached to it, and when a process last detached from it.
-u
This provides a summary about how much shared memory is being used and whether it has been swapped or is in memory.
-l
This shows the system-wide limits for shared memory usage.
-p
This shows the PIDs of the processes that created and last used the shared memory segments.
x
This shows users who are the creators and owners of the shared memory segments.

Because shared memory is used by multiple processes, it cannot be attributed to any particular process. ipcs provides enough information about the state of the system-wide shared memory to determine which processes allocated the shared memory, which processes are using it, and how often they are using it. This information proves useful when trying to reduce shared memory usage.

5.2.7.2 Example Usage

First, in Listing 5.11, we ask ipcs how much of the system memory is being used for shared memory. This is a good overall indication of the state of shared memory in the system.

Listing 5.11.


[ezolt@wintermute tmp]$ ipcs -u



------ Shared Memory Status --------

segments allocated 21

pages allocated 1585

pages resident  720

pages swapped   412

Swap performance: 0 attempts     0 successes



------ Semaphore Status --------

used arrays = 0

allocated semaphores = 0



------ Messages: Status --------

allocated queues = 0

used headers = 0

used space = 0 bytes

In this case, we can see that 21 different segments or pieces of shared memory have been allocated. All these segments consume a total of 1,585 pages of memory; 720 of these exist in physical memory and 412 have been swapped to disk.

Next, in Listing 5.12, we ask ipcs for a general overview of all the shared memory segments in the system. This indicates who is using each memory segment. In this case, we see a list of all the shared segments. For one in particular, the one with a share memory ID of 65538, the user (ezolt) is the owner. It has a permission of 600 (a typical UNIX permission), which in this case, means that only ezolt can read and write to it. It has 393,216 bytes, and 2 processes are attached to it.

Listing 5.12.


[ezolt@wintermute tmp]$ ipcs



------ Shared Memory Segments --------

key        shmid      owner      perms      bytes      nattch     status

0x00000000 0          root      777        49152      1

0x00000000 32769      root      777        16384      1

0x00000000 65538      ezolt     600        393216     2          dest

0x00000000 98307      ezolt     600        393216     2          dest

0x00000000 131076     ezolt     600        393216     2          dest

0x00000000 163845     ezolt     600        393216     2          dest

0x00000000 196614     ezolt     600        393216     2          dest

0x00000000 229383     ezolt     600        393216     2          dest

0x00000000 262152     ezolt     600        393216     2          dest

0x00000000 294921     ezolt     600        393216     2          dest

0x00000000 327690     ezolt     600        393216     2          dest

0x00000000 360459     ezolt     600        393216     2          dest

0x00000000 393228     ezolt     600        393216     2          dest

0x00000000 425997     ezolt     600        393216     2          dest

0x00000000 458766     ezolt     600        393216     2          dest

0x00000000 491535     ezolt     600        393216     2          dest

0x00000000 622608     ezolt     600        393216     2          dest

0x00000000 819217     root      644        110592     2          dest

0x00000000 589842     ezolt     600        393216     2          dest

0x00000000 720916     ezolt     600        12288      2          dest

0x00000000 786454     ezolt     600        12288      2          dest



------ Semaphore Arrays --------

key        semid      owner      perms      nsems



------ Message Queues --------

key        msqid      owner      perms      used-bytes   messages

Finally, we can figure out exactly which processes created the shared memory segments and which other processes are using them, as shown in Listing 5.13. For the segment with shmid 32769, we can see that the PID 1229 created it and 11954 was the last to use it.

Listing 5.13.


[ezolt@wintermute tmp]$ ipcs -p



------ Shared Memory Creator/Last-op --------

shmid      owner      cpid       lpid

0          root       1224       11954

32769      root       1224       11954

65538      ezolt      1229       11954

98307      ezolt      1229       11954

131076     ezolt      1276       11954

163845     ezolt      1276       11954

196614     ezolt      1285       11954

229383     ezolt      1301       11954

262152     ezolt      1307       11954

294921     ezolt      1309       11954

327690     ezolt      1313       11954

360459     ezolt      1305       11954

393228     ezolt      1321       11954

425997     ezolt      1321       11954

458766     ezolt      1250       11954

491535     ezolt      1250       11954

622608     ezolt      1313       11954

819217     root       1224       11914

589842     ezolt      1432       14221

720916     ezolt      1250       11954

786454     ezolt      1313       11954







------ Message Queues PIDs --------

msqid      owner      lspid       lrpid

After we have the PID responsible for the allocation and use, we can use a command such as ps -o command PID to track the PID back to the process name.

If shared memory usage becomes a significant amount of the system total, ipcs is a good way to track down the exact programs that are creating and using the shared memory.

5.2.8. Dynamic Languages (Java, Mono)

As with the CPU performance tools, most of the tools discussed in this chapter support analysis of static languages such as C and C++. Of the tools that we investigated, only ps, /proc, and ipcs work with dynamic languages such as Java, Mono, Python, and Perl. The cache and memory-profiling tools, such as oprofile, cachegrind, and memprof, do not. As with CPU profiling, each of these languages provides custom tools to extract information about memory usage.

For Java applications, if the java command is run with the -Xrunhprof command-line option, it profiles the application's memory usage. You can find more details at http://antprof.sourceforge.net/hprof.html or by running the java command with the -Xrunhprof:help option. For Mono applications, if the mono executable is passed the --profile flag, it also profiles the memory usage of the application. You can find more details about this at http://www.go-mono.com/performance.html. Perl and Python do not appear to have similar functionality.

< Day Day Up >