2. Related Work
Few other profiling systems can monitor complete system activity with high-frequency sampling and low overhead; only ours and Morph[Zha97] are designed to run continuously for long periods on production systems, something that is essential for obtaining useful profiles of large complex applications such as databases. In addition, we know of no other system that can analyze time-biased samples to produce accurate fine-grained information about the number of cycles taken by each instruction and the reasons for stalls; the only other tools that can produce similar information use simulators, at much higher cost.
Table 1 below compares several profiling systems. The overhead column describes how much profiling slows down the target program; low overhead is defined arbitrarily as less than 20%. The scope column shows whether the profiling system is restricted to a single application (App) or can measure full system activity (Sys). The grain column indicates the range over which an individual measurement applies. For example, gprof counts procedure executions, whereas pixie can count executions of each instruction. Prof goes even further and reports the time spent executing each instruction, which, given the wide variations in latencies of different instructions, is often more useful than just an execution count. The stalls column indicates whether and how well the system can subdivide the time spent at an instruction into components like cache miss latency, branch misprediction delays, etc.
System | Overhead | Scope | Grain | Stalls |
---|---|---|---|---|
pixie | High | App | inst count | none |
gprof | High | App | proc count | none |
jprof | High | App | proc count | none |
quartz | High | App | proc count | none |
MTOOL | High | App | inst count/time | inaccurate |
SimOS | High | Sys | inst time | accurate |
SpeedShop (pixie) | High | App | inst count | none |
VTune (dynamic) | High | App | inst time | accurate |
prof | Low | App | inst time | none |
iprobe | High | Sys | inst time | inaccurate |
Morph | Low | Sys | inst time | none |
VTune (sampler) | Low | Sys | inst time | inaccurate |
SpeedShop (timer and counters) | Low | Sys | inst time | inaccurate |
DCPI | Low | Sys | inst time | accurate |
The systems fall into two groups. The first includes pixie[MIPS90], gprof[GraKM82], jprof[ReiS94], quartz[AndL91], MTOOL[GolH93], SimOS[RosHWG95], part of SGI's SpeedShop[Zag96], and Intel's VTune dynamic analyzer[VTune]. These systems use binary modification, compiler support, or direct simulation of programs to gather measurements. They all have high overhead and usually require significant user intervention. The slowdown is too large for continuous measurements during production use, despite techniques that reduce instrumentation overhead substantially[BalL94]. In addition, only the simulation-based systems provide accurate information about the locations and causes of stalls.
The systems in the second group use statistical sampling to collect fine-grained information on program or system behavior. Some sampling systems, including Morph[Zha97], prof[prof], and part of SpeedShop, rely on an existing source of interrupts (e.g., timer interrupts) to generate program-counter samples. This prevents them from sampling within those interrupt routines, and can also result in correlations between the sampling and other system activity. By using hardware performance counters and randomizing the interval between samples, we are able to sample activity within essentially the entire system (except for our interrupt handler itself) and to avoid correlations with any other activity. This issue is discussed further in Section 4.1.1
Other systems that use performance counters, including iprobe[Ipr], the VTune sampler[VTune], and part of SpeedShop, share some of the characteristics of our system. However, iprobe and VTune cannot be used for continuous profiling, mostly because they need a lot of memory for sample data. In addition, iprobe, the VTune sampler, and SpeedShop all fail to map the sample data accurately back to individual instructions. In contrast, our tools produce an accurate accounting of stall cycles incurred by each instruction and the potential reason(s) for the stalls.
Beginning of paper
Abstract
1. Introduction
2. Related Work
3. Data Analysis Examples
4. Data Collection System
5. Profiling Performance
6. Data Analysis Overview
7. Future Directions
8. Conclusions
Acknowledgements
References