< Day Day Up > |
5.2. Memory Performance ToolsThis section examines the various memory performance tools that enable you to investigate how a given application is using the memory subsystem, including the amount and different types of memory that a process is using, where it is being allocated, and how effectively the process is using the processor's cache. 5.2.1. psps is an excellent command to track a process's dynamic memory usage. In addition to the CPU statistics already mentioned, ps gives detailed information about the amount of memory that the application is using and how that memory usage affects the system. 5.2.1.1 Memory Performance-Related Optionsps has many different options and can retrieve many different statistics about the state of a running application. As you saw in the previous chapter, ps can retrieve information about the CPU that a process is spending, but it also can retrieve information about the amount and type of memory that a process is using. It can be invoked with the following command line: ps [-o vsz,rss,tsiz,dsiz,majflt,minflt,pmem,command] <PID> Table 5-1 describes the different types of memory statistics that ps can display for a given PID. As mentioned in the preceding chapter, ps is flexible in regards to how you select the group of PIDs for which statistics display. ps —help provides information on how to specify different groups of PIDs. 5.2.1.2 Example UsageListing 5.1 shows the burn test application running on the system. We ask ps to tell us information about the memory statistics of the process. Listing 5.1.[ezolt@wintermute tmp]$ ps –o vsz,rss,tsiz,dsiz,majflt,minflt,cmd 10882 VSZ RSS TSIZ DSIZ MAJFLT MINFLT CMD 11124 10004 1 11122 66 2465 ./burn As Listing 5.1 shows, the burn application has a very small text size (1KB), but a very large data size (11,122KB). Of the total virtual size (11,124KB), the process has a slightly smaller resident set size (10,004KB), which represents the total amount of physical memory that the process is actually using. In addition, most of the faults generated by burn were minor faults, so most of the memory faults were due to memory allocation rather than loading in a large amount of text or data from the program image on the disk. 5.2.2. /proc/<PID>The Linux kernel provides a virtual file system that enables you to extract information about the processes running on the system. The information provided by the /proc file system is usually only used by performance tools such as ps to extract performance data from the kernel. Although it is not normally necessary to dig through the files in /proc, it does provide some information that you cannot retrieve with other performance tools. In addition to many other statistics, /proc provides information about a process's use of memory and mapping of libraries. 5.2.2.1 Memory Performance-Related OptionsThe interface to /proc is straightforward. /proc provides many virtual files that you can cat to extract their information. Each running PID in the system has a subdirectory in /proc. Within this subdirectory is a series of files containing information about that PID. One of these files, status, provides information about the status of a given process PID. You can retrieve this information by using the following command: cat /proc/<PID>/status Table 5-2 describes the memory statistics displayed in the status file. Another one of the files present in the <PID> directory is the maps file. This provides information about how the process's virtual address space is used. You can retrieve it by using the following command: cat /proc/<PID>/maps Table 5-3 describes the fields shown in the maps file. The information that /proc provides can help you understand how an application is allocating memory and which libraries it is using. 5.2.2.2 Example UsageListing 5.2 shows the burn test application running on the system. First, we use ps to find the PID (4540) of burn. Then we extract the process's memory statistics using the /proc status file. Listing 5.2.[ezolt@wintermute tmp]$ ps aux | grep burn ezolt 4540 0.4 2.6 11124 10004 pts/0 T 08:26 0:00 ./burn ezolt 4563 0.0 0.1 1624 464 pts/0 S 08:29 0:00 grep burn [ezolt@wintermute tmp]$ cat /proc/4540/status Name: burn State: T (stopped) Tgid: 4540 Pid: 4540 PPid: 1514 TracerPid: 0 Uid: 501 501 501 501 Gid: 501 501 501 501 FDSize: 256 Groups: 501 9 502 VmSize: 11124 kB VmLck: 0 kB VmRSS: 10004 kB VmData: 9776 kB VmStk: 8 kB VmExe: 4 kB VmLib: 1312 kB SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 As Listing 5.2 shows, once again we see that the burn application has a very small text size (4KB) and stack size (8KB), but a very large data size (9,776KB) and a reasonably sized library size (1,312KB). The small text size means that the process does not have much executable code, whereas the moderate library size means that it is using a library to support its execution. The small stack size means that the process is not calling deeply nested functions or is not calling functions that use large or many temporary variables. The VmLck size of 0KB means that the process has not locked any pages into memory, making them unswappable. The VmRSS size of 10,004KB means that the application is currently using 10,004KB of physical memory, although it has either allocated or mapped the VmSize or 11,124KB. If the application begins to use the memory that it has allocated but is not currently using, the VmRSS size increases but leaves the VmSize unchanged. As noted previously, the application's VmLib size is nonzero, so it is using a library. In Listing 5.3, we look at the process's maps to see the exact libraries it is using. Listing 5.3.[ezolt@wintermute test_app]$ cat /proc/4540/maps 08048000-08049000 r-xp 00000000 21:03 393730 /tmp/burn 08049000-0804a000 rw-p 00000000 21:03 393730 /tmp/burn 0804a000-089d3000 rwxp 00000000 00:00 0 40000000-40015000 r-xp 00000000 21:03 1147263 /lib/ld-2.3.2.so 40015000-40016000 rw-p 00015000 21:03 1147263 /lib/ld-2.3.2.so 4002e000-4002f000 rw-p 00000000 00:00 0 4002f000-40162000 r-xp 00000000 21:03 2031811 /lib/tls/libc-2.3.2.so 40162000-40166000 rw-p 00132000 21:03 2031811 /lib/tls/libc-2.3.2.so 40166000-40168000 rw-p 00000000 00:00 0 bfffe000-c0000000 rwxp fffff000 00:00 0 As you see in Listing 5.3, the burn application is using two libraries: ld and libc. The text section (denoted by the permission r-xp) of libc has a range of 0x4002f000 through 0x40162000 or a size of 0x133000 or 1,257,472 bytes. The data section (denoted by permission rw-p) of libc has a range of 40162000 through 40166000 or a size of 0x4000 or 16,384 bytes. The text size of libc is bigger than ld's text size of 0x15000 or 86,016 bytes. The data size of libc is also bigger than ld's text size of 0x1000 or 4,096 bytes. libc is the big library that burn is linking in. /proc proves to be a useful way to extract performance statistics directly from the kernel. Because the statistics are text based, you can use the standard Linux tools to access them. 5.2.3. memprofmemprof is a graphical memory-usage profiling tool. It shows how a program is allocating memory as it runs. memprof shows the amount of memory your application is consuming and which functions are responsible for the memory consumption. In addition, memprof can show which code paths are responsible for memory usage. For example, if a function foo() allocates no memory, but calls a function bar(), which allocates a large amount of memory, memprof shows you how much foo() itself used and all the functions that foo() called. memprof updates this information dynamically as the application is running. 5.2.3.1 Memory Performance-Related Optionsmemprof is a graphical application, but has a few command-line options that modify its execution. It is invoked with the following command: memprof [--follow-fork] [--follow-exec] application memprof profiles the given "application" and creates a graphical display of its memory usage. Although memprof can be run on any application, it can provide more information if the application and the libraries that it relies on are compiled with debugging symbols. Table 5-4 describes the options that manipulate the behavior of memprof if it is monitoring an application that calls fork or exec. This normally happens when an application launches a new process or executes a new command. Once invoked, memprof creates a window with a series of menus and options that enable you to select an application that you are going to profile. 5.2.3.2 Example UsageSuppose that I have the example code in Listing 5.4 and I want to profile it. In this application, which I call memory_eater, the function foo() does not allocate any memory, but it calls the function bar(), which does. Listing 5.4.#include <stdlib.h> void bar(void) { malloc(10000); } void foo(void) { int i; for (i=0; i<100;i++) bar(); } int main() { foo(); while(1); } After compiling this application with the -g3 flag (so that the application has symbols included), we use memprof to profile this application: [ezolt@localhost example]$ memprof ./memory_eater memintercept (3965): _MEMPROF_SOCKET = /tmp/memprof.Bm1AKu memintercept (3965): New process, operation = NEW, old_pid = 0 memprof creates the application window shown in Figure 5-1. As you can see, it shows memory usage information about the memory_eater application, as well as a series of buttons and menus that enable you to manipulate the profile. Figure 5-1.If you click the Profile button, memprof shows the memory profile of the application. The first information box in Figure 5-2 shows how much memory each function is consuming (denoted by "self"), as well as the sum of the memory that the function and its children are consuming (denoted by "total"). As expected, the foo() function does not allocate any memory, so its self value is 0, whereas its total value is 100,000, because it is calling a function that does allocate memory. Figure 5-2.The children and callers information boxes change as you click different functions in the top box. This way, you can see which functions of an application are using memory. memprof provides a way to graphically traverse through a large amount of data about memory allocation. It provides an easy way to determine the memory allocation of a given function and each functions that it calls. 5.2.4. valgrind (cachegrind)valgrind is a powerful tool that enables you to debug tricky memory-management errors. Although valgrind is mainly a developer's tool, it also has a "skin" that can show processor cache usage. valgrind simulates the current processor and runs the application in this virtual processor while tracking memory usage. It can also simulate the processor cache and pinpoint where a program is hitting and missing in the instruction and data caches. Although very useful, its cache statistics are inexact (because valgrind is only a simulation of the processor rather than an actual piece of hardware). valgrind will not account for cache misses normally caused by system calls to the Linux kernel or cache misses that happen because of context switching. In addition, valgrind runs applications at a much slower speed than a natively executing program. However, valgrind provides a great first approximation of the cache usage of an application. valgrind can be run on any executable; if the program has been compiled with symbols (passed as -g3 to gcc when compiling), however, it will be able to pinpoint the exact line of code responsible for the cache usage. 5.2.4.1 Memory Performance-Related OptionsWhen using valgrind to analyze cache usage for a particular application, there are two phases: collection and annotation. The collection phase starts with the following command line: valgrind --skin=cachegrind application valgrind is a flexible tool that has a few different "skins" that allow it to perform different types of analysis. During the collection phase, valgrind uses the cachegrind skin to collect information about cache usage. The application in the preceding command line represents the application to profile. The collection phase prints summary information to the screen, but it also saves more detailed statistics in a file named cachegrind.out.pid, where pid is the PID of the profiled application as it was running. When the collection phase is complete, the command cg_annoate is used to map the cache usage back to the application source code. cg_annote is invoked in the following way: cg_annotate --pid [--auto=yes|no] cg_annotate takes the information generated by valgrind and uses it to annotate the application that was profiled. The --pid option is required, where pid is the PID of the profile that you are interested in. By default, cg_annotate just shows cache usage at the function level. If you set --auto=yes, cache usage displays at the source-line level. 5.2.4.2 Example UsageThis example shows valgrind (v2.0) running on a simple application. The application clears a large area of memory and then calls two functions, a() and b(), and each touches this memory. Function a() touches the memory ten times as often as function b(). First, as shown in Listing 5.5, we run valgrind on the application using the cachegrind skin. Listing 5.5.[ezolt@wintermute test_app]$ valgrind --skin=cachegrind ./burn ==25571== Cachegrind, an I1/D1/L2 cache profiler for x86-linux. ==25571== Copyright (C) 2002-2003, and GNU GPL'd, by Nicholas Nethercote. ==25571== Using valgrind-2.0.0, a program supervision framework for x86-linux. ==25571== Copyright (C) 2000-2003, and GNU GPL'd, by Julian Seward. ==25571== Estimated CPU clock rate is 468 MHz ==25571== For more details, rerun with: -v ==25571== ==25571== ==25571== I refs: 11,317,111 ==25571== I1 misses: 215 ==25571== L2i misses: 214 ==25571== I1 miss rate: 0.0% ==25571== L2i miss rate: 0.0% ==25571== ==25571== D refs: 6,908,012 (4,405,958 rd + 2,502,054 wr) ==25571== D1 misses: 1,412,821 (1,100,287 rd + 312,534 wr) ==25571== L2d misses: 313,810 ( 1,276 rd + 312,534 wr) ==25571== D1 miss rate: 20.4% ( 24.9% + 12.4% ) ==25571== L2d miss rate: 4.5% ( 0.0% + 12.4% ) ==25571== ==25571== L2 refs: 1,413,036 (1,100,502 rd + 312,534 wr) ==25571== L2 misses: 314,024 ( 1,490 rd + 312,534 wr) ==25571== L2 miss rate: 1.7% ( 0.0% + 12.4% ) In the run of Listing 5.5, the application executed 11,317,111 instructions; this is shown by the I refs statistics. The process had an astonishingly low number of misses in both the L1 (215) and L2 (216) instruction cache, as denoted by an I1 and L2i miss rate of 0.0 percent. The process had a total number of 6,908,012 data references, 4,405,958 were reads and 2,502,054 were writes. 24.9 percent of the reads and 12.4 percent of the writes could not be satisfied by the L1 cache. Luckily, we can almost always satisfy the reads in the L2 data, and they are shown to have a miss rate of 0 percent. The writes are still a problem with a miss rate of 12.4 percent. In this application, memory access of the data is the problem to investigate. The ideal application would have a very low number of instruction cache and data cache misses. To eliminate instruction cache misses, it may be possible to recompile the application with different compiler options or trim code, so that the hot code does not have to share icache space with the code that is not used often. To eliminate data cache misses, use arrays for data structures rather than linked lists, if possible, and reduce the size of elements in data structures, and access memory in a cache-friendly way. In any event, valgrind helps to point out which accesses/data structures should be optimized. This application run summary shows that data accesses are the main problem. As shown in Listing 5.5, this command displays cache usage statistics for the overall run. However, when developing an application, or investigating a performance problem, it is often more interesting to see where cache misses happen rather than just the totals during an application's runtime. To determine which functions are responsible for the cache misses, we run cg_annotate, as shown in Listing 5.6. This shows us which functions are responsible for which cache misses. As we expect, the function a() has 10 times (1,000,000) the misses of the function b() (100,000). Listing 5.6.[ezolt@wintermute test_app]$ cg_annotate --25571 ---------------------------------------------------------------------------------------- I1 cache: 16384 B, 32 B, 4-way associative D1 cache: 16384 B, 32 B, 4-way associative L2 cache: 131072 B, 32 B, 4-way associative Command: ./burn Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw Event sort order:Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw Thresholds: 99 0 0 0 0 0 0 0 0 Include dirs: User annotated: Auto-annotation: off ---------------------------------------------------------------------------------------- Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw ---------------------------------------------------------------------------------------- 11,317,111 215 214 4,405,958 1,100,287 1,276 2,502,054 312,534 312,534 PROGRAM TOTALS ---------------------------------------------------------------------------------------- Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function ---------------------------------------------------------------------------------------- 8,009,011 2 2 4,003,003 1,000,000 989 1,004 0 0 burn.c:a 2,500,019 3 3 6 1 1 2,500,001 312,500 312,500 ???:__GI_memset 800,911 2 2 400,303 100,000 0 104 0 0 burn.c:b Although a per-function breakdown of cache misses is useful, it would be interesting to see which lines within the application are actually causing the cache misses. If we use the --auto option, as demonstrated in Listing 5.7, cg_annotate tells us exactly which line is responsible for each miss. Listing 5.7.[ezolt@wintermute test_app]$ cg_annotate --25571 --auto=yes ---------------------------------------------------------------------------------------- I1 cache: 16384 B, 32 B, 4-way associative D1 cache: 16384 B, 32 B, 4-way associative L2 cache: 131072 B, 32 B, 4-way associative Command: ./burn Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw Thresholds: 99 0 0 0 0 0 0 0 0 Include dirs: User annotated: Auto-annotation: on ---------------------------------------------------------------------------------------- Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw ---------------------------------------------------------------------------------------- 11,317,111 215 214 4,405,958 1,100,287 1,276 2,502,054 312,534 312,534 PROGRAM TOTALS ---------------------------------------------------------------------------------------- Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function ---------------------------------------------------------------------------------------- 8,009,011 2 2 4,003,003 1,000,000 989 1,004 0 0 burn.c:a 2,500,019 3 3 6 1 1 2,500,001 312,500 312,500 ???:__GI_memset 800,911 2 2 400,303 100,000 0 104 0 0 burn.c:b ---------------------------------------------------------------------------------------- -- Auto-annotated source: burn.c ---------------------------------------------------------------------------------------- Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw -- line 2 ---------------------------------------- . . . . . . . . . . . . . . . . . . #define ITER 100 . . . . . . . . . #define SZ 10000000 . . . . . . . . . #define STRI 10000 . . . . . . . . . . . . . . . . . . char test[SZ]; . . . . . . . . . . . . . . . . . . void a(void) 3 0 0 . . . 1 0 0 { 2 0 0 . . . 2 0 0 int i=0,j=0; 5,004 1 1 2,001 0 0 1 0 0 for(j=0;j<10*ITER ; j++) 5,004,000 0 0 2,001,000 0 0 1,000 0 0 for(i=0;i<SZ;i=i+STRI) . . . . . . . . . { 3,000,000 1 1 2,000,000 1,000,000 989 . . . test[i]++; . . . . . . . . . } 2 0 0 2 0 0 . . . } . . . . . . . . . . . . . . . . . . void b(void) 3 1 1 . . . 1 0 0 { 2 0 0 . . . 2 0 0 int i=0,j=0; 504 0 0 201 0 0 1 0 0 for (j=0;j<ITER; j++) 500,400 1 1 200,100 0 0 100 0 0 for (i=0;i<SZ;i=i+STRI) . . . . . . . . . { 300,000 0 0 200,000 100,000 0 . . . test[i]++; . . . . . . . . . } 2 0 0 2 0 0 . . . } . . . . . . . . . . . . . . . . . . . . . . . . . . . main() 6 2 2 . . . 1 0 0 { . . . . . . . . . . . . . . . . . . /* Arbitrary value*/ 6 0 0 . . . 4 0 0 memset(test, 42, SZ); 1 0 0 . . . 1 0 0 a(); 1 0 0 . . . 1 0 0 b(); 2 0 0 2 1 1 . . . } ----------------------------------------------------------------------------------------- Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw ----------------------------------------------------------------------------------------- 78 3 3 100 100 78 0 0 0 percentage of events annotated As Listing 5.7 shows, we have a line-by-line breakdown of where different cache misses and hits are occurring. We can see that the inner for loops have almost all the data references. As we expect, the for loop in the a() function has ten times those of the b() function. The different level of detail (program level, function level, and line level) that valgrind/cachegrind provides can give you a good idea of which parts of an application are accessing memory and effectively using the processor caches. 5.2.5. kcachegrindkcachegrind works intimately with valgrind to provide detailed information about the cache usage of a profiled application. It adds two new pieces of functionality above that of standard valgrind. First, it provides a skin for valgrind, called calltree, that captures both cache and call-tree statistics for a particular application. Second, it provides a graphical exploration the cache performance information and innovative visuals of the data. 5.2.5.1 Memory Performance-Related OptionsSimilar to valgrind, when using kcachegrind to analyze cache usage for a particular application, there are two phases: collection and annotation. The collection phase starts with the following command line: calltree application The calltree command accepts many different options to manipulate the information to be collected. Table 5-5 shows some of the more important options. calltree can record many different statistics. Refer to calltree's help option for more details. During the collection phase, valgrind uses the calltree skin to collect information about cache usage. The application in the preceding command line represents the application to profile. During the collection phase, calltree prints summary information to the screen and, identically to cachegrind, it also saves more detailed statistics in a file named cachegrind.out.pid, where pid is the PID of the profiled application as it was running. When the collection phase is complete, the command kcachegrind is used to map the cache usage back to the application source code. kcachegrind is invoked in the following way: kcachegrind cachegrind.out.pid kcachegrind displays the cache profiles statistics that have been collected and enables you to navigate through the results. 5.2.5.2 Example UsageThe first step in using kcachegrind is to compile the application with symbols to allow sample-to-source line mappings. This is done with the following command: [ezolt@wintermute test_app]$ gcc -o burn burn.c -g3 Next, run calltree against that application, as shown in Listing 5.8. This provides output similar to cachegrind, but, most importantly, it generates a cachegrind.out file, which will be used by kcachegrind. Listing 5.8.[ezolt@wintermute test_app]$ calltree --dump-instr=yes --trace-jump=yes ./burn ==12242== Calltree-0.9.7, a call-graph generating cache profiler for x86-linux. ==12242== Copyright (C) 2002-2004, and GNU GPL'd, by N.Nethercote and J.Weidendorfer. ==12242== Using valgrind-2.0.0, a program supervision framework for x86-linux. ==12242== Copyright (C) 2000-2003, and GNU GPL'd, by Julian Seward. ==12242== Estimated CPU clock rate is 469 MHz ==12242== For more details, rerun with: -v ==12242== ==12242== ==12242== I refs: 33,808,151 ==12242== I1 misses: 216 ==12242== L2i misses: 215 ==12242== I1 miss rate: 0.0% ==12242== L2i miss rate: 0.0% ==12242== ==12242== D refs: 29,404,027 (4,402,969 rd + 25,001,058 wr) ==12242== D1 misses: 4,225,324 (1,100,290 rd + 3,125,034 wr) ==12242== L2d misses: 4,225,324 (1,100,290 rd + 3,125,034 wr) ==12242== D1 miss rate: 14.3% ( 24.9% + 12.4% ) ==12242== L2d miss rate: 14.3% ( 24.9% + 12.4% ) ==12242== ==12242== L2 refs: 4,225,540 (1,100,506 rd + 3,125,034 wr) ==12242== L2 misses: 4,225,539 (1,100,505 rd + 3,125,034 wr) ==12242== L2 miss rate: 6.6% ( 2.8% + 12.4% ) When we have the cachegrind.out file, we can start kcachegrind (v.0.54) to analyze the data by using the following command: [ezolt@wintermute test_app]$ kcachegrind cachegrind.out.12242 This brings up the window shown in Figure 5-3. The window shows a flat profile of all the cache misses in the left pane. By default, data read misses from the L1 cache are shown. Figure 5-3.Next, in Figure 5-4, in the upper-right pane, we can see a visualization of the callee map, or all the functions (a() and b()) that the function in the left pane (main) calls. In the lower-right pane, we can see the application's call graph. Figure 5-4.Finally, in Figure 5-5, we select a different function to examine in the left pane. We also select a different event to examine (instruction fetches) using the upper-right pane. Finally, we can visualize the loops in the assembly code using the lower-right pane. Figure 5-5.These examples barely scratch the surface of what kcachegrind can do, and the best way to learn about it is to try it. kcachegrind is an extraordinarily useful tool for those who like to investigate performance issues visually. 5.2.6. oprofile (III)As you have seen in previous chapters, oprofile is a powerful tool that can help to determine where application time is spent. However, oprofile can also work with the processor's performance counters to provide an accurate view of how it performs. Whereas cachegrind simulates the CPU to find cache misses/hits, oprofile uses the actual CPU performance counters to pinpoint the location of cache misses. Unlike cachegrind, oprofile does not simulate the processor, so cache effects that it records are real; those caused by the operating system will be visible. In addition, an application analyzed under oprofile will run at near native speeds, whereas cachegrind takes much longer. However, cachegrind is easier to set up and use than oprofile, and it can always track the same events no matter what version of x86 processor it runs on. cachegrind is a good tool to use when you need a quick and pretty accurate answer, whereas oprofile proves useful when you need more accurate statistics or need statistics that cachegrind does not provide. 5.2.6.1 Memory Performance-Related OptionsThis discussion of oprofile does not add any new command-line options because they have already been described in section outlining CPU performance. However, one command becomes more important as you start to sample events different from oprofile's defaults. Different processors and architectures can sample different sets of events and oprofile's op_help command displays the list of events that your current processor supports. 5.2.6.2 Example UsageAs mentioned previously, the events that oprofile can monitor are processor specific, so these examples are run on my current machine, which is a Pentium-III. On the Pentium-III, we can use performance counters exposed by oprofile to gather the similar information that was provided by valgrind with the cachegrind skin. This uses the performance counter hardware rather than software simulation. Using the performance hardware presents two pitfalls. First, we must deal with the underlying hardware limitations of the performance counters. On the Pentium-III, oprofile can only measure two events simultaneously, whereas cachegrind could measure many types of memory events simultaneously. This means that for oprofile to measure the same events as cachegrind, we must run the application multiple times and change the events that oprofile monitors during each run. The second pitfall is the fact that oprofile does not provide an exact count of the events like cachegrind, only samples the counters, so we will be able to see where the events most likely occur, but we will not be able to see the exact number. In fact, we will only receive (1/sample rate) number of samples. If an application only causes an event to happen a few times, it might not be recorded at all. Although it can be frustrating to not know the exact number of events when debugging a performance problem, it is usually most important to figure the relative number of samples between code lines. Even though you will not be able to directly determine the total number of events that occurred at a particular source line, you will be able to figure out the line with most events, and that is usually enough to start debugging a performance problem. This inability to retrieve exact event counts is present in every form of CPU sampling and will be present on any processor performance hardware that implements it. However, it is really this limitation that allows this performance hardware to exist at all; without it, the performance monitoring would cause too much overhead. oprofile can monitor many different types of events, which vary on different processors. This section outlines some of the important events on the Pentium-III for an example. The Pentium-III can monitor the L1 data cache (called the DCU on the Pentium-III) by using the following events (events and descriptions provided by op_help): [ezolt@localhost book]$ op_help oprofile: available events for CPU type "PIII" See Intel Architecture Developer's Manual Volume 3, Appendix A and Intel Architecture Optimization Reference Manual (730795-001) .... DCU_LINES_IN: total lines allocated in the DCU DCU_M_LINES_IN: number of M state lines allocated in DCU DCU_M_LINES_OUT: number of M lines evicted from the DCU DCU_MISS_OUTSTANDING: number of cycles while DCU miss outstanding ... Notice that we do not have an exact correspondence to what cachegrind provided, which was the number of "reads and writes" to the L1 data cache. However, we can figure out how many cycles of L1 data misses each function had by using the DCU_LINES_IN event. Although this event does not tell us the exact number of misses that each had, it should tell us how much each function missed in the cache relative to each other. The events monitoring the L2 data cache are a little closer, but it still does not have an exact correspondence to what cachegrind was providing. Here are the relevant L2 data cache events for the Pentium-III: [ezolt@localhost book]$ op_help ..... L2_LD: number of L2 data loads Unit masks ---------- 0x08: (M)odified cache state 0x04: (E)xclusive cache state 0x02: (S)hared cache state 0x01: (I)nvalid cache state 0x0f: All cache states L2_ST: number of L2 data stores Unit masks ---------- 0x08: (M)odified cache state 0x04: (E)xclusive cache state 0x02: (S)hared cache state 0x01: (I)nvalid cache state 0x0f: All cache states L2_LINES_IN: number of allocated lines in L2 .... The Pentium-III actually supports many more than these, but these are the basic load and store events. (cachegrind calls a load a "read" and a store a "write.") We can use these events to count the number of loads and stores that happened at each source line. We can also use the L2_LINES_IN to show which pieces of code had more L2 cache misses. As mentioned earlier, we will not get an exact value. In addition, on the Pentium-III, both instruction and data share the L2 cache. Any L2 misses could be the result of instruction or data misses. oprofile shows us the cache misses that occurred in the L2 as a result of instructions and data misses, whereas cachegrind helpfully splits it up for us. On the Pentium-III, there are a similar series of events to monitor the instruction cache. For the L1 instruction cache (what the Pentium-III calls the "IFU"), we can measure the number of reads (or fetches) and misses directly by using the following events: [ezolt@localhost book]$ op_help ... IFU_IFETCH: number of non/cachable instruction fetches IFU_IFETCH_MISS: number of instruction fetch misses ... We can also measure the number of instructions that were fetched from the L2 cache by using the following event: [ezolt@localhost book]$ op_help ... L2_IFETCH: (counter: 0, 1) number of L2 instruction fetches (min count: 500) Unit masks ---------- 0x08: (M)odified cache state 0x04: (E)xclusive cache state 0x02: (S)hared cache state 0x01: (I)nvalid cache state 0x0f: All cache states ... Unfortunately, as mentioned previously, the processor shares the L2 cache with the instruction and data, so there is no way to distinguish cache misses from data usage and instruction usage. We can use these events to approximate the same information that cachegrind provided. If it is so difficult to extract the same information, why use oprofile at all? First, oprofile is much lower overhead (< 10 percent) and can be run on production applications. Second, oprofile can extract the exact events that are happening. This is much better than a possibly inaccurate simulator that does not take into account cache usage by the operating system or other applications. Although these events are available on the Pentium-III, they are not necessarily available on any other processors. Each family of Intel and AMD processors has a different set of events that can be used to reveal different amounts of information about performance of the memory subsystem. op_help shows the events that can be monitored, but it might require reading Intel's or AMD's detailed processor information to understand what they mean. To understand how you can use oprofile to extract cache information, compare the cache usage statistics of an application using both the virtual CPU of cachegrind and the actual CPU with oprofile. Let's run the burn program that we have been using as an example. Once again, it is an application that executes ten times the number of instructions in function a() as in function b(). It also accesses ten times the amount of data in function a() as in function b(). Here is the output of cachegrind for this demo application: --------------------------------------------------------------------------------------------- Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw --------------------------------------------------------------------------------------------- 883,497,211 215 214 440,332,658 110,000,288 1,277 2,610,954 312,534 312,533 PROGRAM TOTALS --------------------------------------------------------------------------------------------- Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function --------------------------------------------------------------------------------------------- 800,900,011 2 2 400,300,003 100,000,000 989 100,004 0 0 ???:a 80,090,011 2 2 40,030,003 10,000,000 0 10,004 0 0 ???:b In the next few examples, I run the data collection phase of oprofile multiple times. I use oprof_start to set up the events for the particular run, and then run the demo application. Because my CPU has only two counters, I have to do this multiple times. This means that different sets of events will be monitored during the different executions of the program. Because my application does not change how it executes from each run, each run should produce similar results. This is not necessarily true for a more complicated application, such as a Web server or database, each of which can dramatically change how it executes based on the requests made to it. However, for the simple test application, it works just fine. After I collect this sampling information, we use opreport to extract the collected information. As shown in Listing 5.9, we query oprofile about the amount of data memory references that were made and how many times there was a miss in the L1 data CPU (DCU). As cachegrind told us, ten times the number of both memory references and L1 data misses occur in function a(). Listing 5.9.[ezolt@wintermute test_app]$ opreport -l ./burn event:DATA_MEM_REFS,DCU_LINES_IN CPU: PIII, speed 467.74 MHz (estimated) Counted DATA_MEM_REFS events (all memory references, cachable and non) with a unit mask of 0x00 (No unit mask) count 233865 Counted DCU_LINES_IN events (total lines allocated in the DCU) with a unit mask of 0x00 (No unit mask) count 23386 vma samples % samples % symbol name 08048348 3640 90.8864 5598 90.9209 a 0804839e 365 9.1136 559 9.0791 b Now look at Listing 5.10, which shows opreport examining similar information for the instruction cache. Notice that the instructions executed in function a() are ten times the number as those in function b(), as we expect. However, notice that the number of L1 I misses differ from what cachegrind predicts. It is most likely that other applications and the kernel are polluting the cache and causing burn to miss in the icache. (Remember that cachegrind does not take kernel or other application cache usage into account.) This also probably happened with the data cache, but because the number of data cache misses caused by the application were so high, the extra events were lost in the noise. Listing 5.10.[ezolt@wintermute test_app]$ opreport -l ./burn event:IFU_IFETCH,IFU_IFETCH_MISS CPU: PIII, speed 467.74 MHz (estimated) Counted IFU_IFETCH events (number of non/cachable instruction fetches) with a unit mask of 0x00 (No unit mask) count 233870 Counted IFU_IFETCH_MISS events (number of instruction fetch misses) with a unit mask of 0x00 (No unit mask) count 500 vma samples % samples % symbol name 08048348 8876 90.9240 14 93.3333 a 0804839e 886 9.0760 1 6.6667 b As you can see when comparing the output of cachegrind and oprofile, using oprofile to gather information about memory information is powerful because oprofile is low overhead and uses the processor hardware directly, but it can be difficult to find events that match those that you are interested in. 5.2.7. ipcsipcs is a tool that shows information about the system-wide interprocess communication memory. Processes can allocate system-wide shared memory, semaphores, or memory queues that can be shared by multiple processes running on the system. ipcs is best used to track down which applications are allocating and using large amounts of shared memory. 5.2.7.1 Memory Performance-Related Optionsipcs is invoked with the following command line: ipcs [-t] [-c] [-l] [-u] [-p] If ipcs is invoked without any parameters, it gives a summary of all the shared memory on the system. This includes information about the owner and size of the shared memory segment. Table 5-6 describes options that cause ipcs to display different types information about the shared memory in the system. Because shared memory is used by multiple processes, it cannot be attributed to any particular process. ipcs provides enough information about the state of the system-wide shared memory to determine which processes allocated the shared memory, which processes are using it, and how often they are using it. This information proves useful when trying to reduce shared memory usage. 5.2.7.2 Example UsageFirst, in Listing 5.11, we ask ipcs how much of the system memory is being used for shared memory. This is a good overall indication of the state of shared memory in the system. Listing 5.11.[ezolt@wintermute tmp]$ ipcs -u ------ Shared Memory Status -------- segments allocated 21 pages allocated 1585 pages resident 720 pages swapped 412 Swap performance: 0 attempts 0 successes ------ Semaphore Status -------- used arrays = 0 allocated semaphores = 0 ------ Messages: Status -------- allocated queues = 0 used headers = 0 used space = 0 bytes In this case, we can see that 21 different segments or pieces of shared memory have been allocated. All these segments consume a total of 1,585 pages of memory; 720 of these exist in physical memory and 412 have been swapped to disk. Next, in Listing 5.12, we ask ipcs for a general overview of all the shared memory segments in the system. This indicates who is using each memory segment. In this case, we see a list of all the shared segments. For one in particular, the one with a share memory ID of 65538, the user (ezolt) is the owner. It has a permission of 600 (a typical UNIX permission), which in this case, means that only ezolt can read and write to it. It has 393,216 bytes, and 2 processes are attached to it. Listing 5.12.[ezolt@wintermute tmp]$ ipcs ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 0 root 777 49152 1 0x00000000 32769 root 777 16384 1 0x00000000 65538 ezolt 600 393216 2 dest 0x00000000 98307 ezolt 600 393216 2 dest 0x00000000 131076 ezolt 600 393216 2 dest 0x00000000 163845 ezolt 600 393216 2 dest 0x00000000 196614 ezolt 600 393216 2 dest 0x00000000 229383 ezolt 600 393216 2 dest 0x00000000 262152 ezolt 600 393216 2 dest 0x00000000 294921 ezolt 600 393216 2 dest 0x00000000 327690 ezolt 600 393216 2 dest 0x00000000 360459 ezolt 600 393216 2 dest 0x00000000 393228 ezolt 600 393216 2 dest 0x00000000 425997 ezolt 600 393216 2 dest 0x00000000 458766 ezolt 600 393216 2 dest 0x00000000 491535 ezolt 600 393216 2 dest 0x00000000 622608 ezolt 600 393216 2 dest 0x00000000 819217 root 644 110592 2 dest 0x00000000 589842 ezolt 600 393216 2 dest 0x00000000 720916 ezolt 600 12288 2 dest 0x00000000 786454 ezolt 600 12288 2 dest ------ Semaphore Arrays -------- key semid owner perms nsems ------ Message Queues -------- key msqid owner perms used-bytes messages Finally, we can figure out exactly which processes created the shared memory segments and which other processes are using them, as shown in Listing 5.13. For the segment with shmid 32769, we can see that the PID 1229 created it and 11954 was the last to use it. Listing 5.13.[ezolt@wintermute tmp]$ ipcs -p ------ Shared Memory Creator/Last-op -------- shmid owner cpid lpid 0 root 1224 11954 32769 root 1224 11954 65538 ezolt 1229 11954 98307 ezolt 1229 11954 131076 ezolt 1276 11954 163845 ezolt 1276 11954 196614 ezolt 1285 11954 229383 ezolt 1301 11954 262152 ezolt 1307 11954 294921 ezolt 1309 11954 327690 ezolt 1313 11954 360459 ezolt 1305 11954 393228 ezolt 1321 11954 425997 ezolt 1321 11954 458766 ezolt 1250 11954 491535 ezolt 1250 11954 622608 ezolt 1313 11954 819217 root 1224 11914 589842 ezolt 1432 14221 720916 ezolt 1250 11954 786454 ezolt 1313 11954 ------ Message Queues PIDs -------- msqid owner lspid lrpid After we have the PID responsible for the allocation and use, we can use a command such as ps -o command PID to track the PID back to the process name. If shared memory usage becomes a significant amount of the system total, ipcs is a good way to track down the exact programs that are creating and using the shared memory. 5.2.8. Dynamic Languages (Java, Mono)As with the CPU performance tools, most of the tools discussed in this chapter support analysis of static languages such as C and C++. Of the tools that we investigated, only ps, /proc, and ipcs work with dynamic languages such as Java, Mono, Python, and Perl. The cache and memory-profiling tools, such as oprofile, cachegrind, and memprof, do not. As with CPU profiling, each of these languages provides custom tools to extract information about memory usage. For Java applications, if the java command is run with the -Xrunhprof command-line option, it profiles the application's memory usage. You can find more details at http://antprof.sourceforge.net/hprof.html or by running the java command with the -Xrunhprof:help option. For Mono applications, if the mono executable is passed the --profile flag, it also profiles the memory usage of the application. You can find more details about this at http://www.go-mono.com/performance.html. Perl and Python do not appear to have similar functionality. |
< Day Day Up > |