Continuous Profiling: Where Have All the Cycles Gone?

2. Related Work

Few other profiling systems can monitor complete system activity with high-frequency sampling and low overhead; only ours and Morph[Zha97] are designed to run continuously for long periods on production systems, something that is essential for obtaining useful profiles of large complex applications such as databases. In addition, we know of no other system that can analyze time-biased samples to produce accurate fine-grained information about the number of cycles taken by each instruction and the reasons for stalls; the only other tools that can produce similar information use simulators, at much higher cost.

Table 1 below compares several profiling systems. The overhead column describes how much profiling slows down the target program; low overhead is defined arbitrarily as less than 20%. The scope column shows whether the profiling system is restricted to a single application (App) or can measure full system activity (Sys). The grain column indicates the range over which an individual measurement applies. For example, gprof counts procedure executions, whereas pixie can count executions of each instruction. Prof goes even further and reports the time spent executing each instruction, which, given the wide variations in latencies of different instructions, is often more useful than just an execution count. The stalls column indicates whether and how well the system can subdivide the time spent at an instruction into components like cache miss latency, branch misprediction delays, etc.

System Overhead Scope Grain Stalls
pixie High App inst count none
gprof High App proc count none
jprof High App proc count none
quartz High App proc count none
MTOOL High App inst count/time inaccurate
SimOS High Sys inst time accurate
SpeedShop (pixie) High App inst count none
VTune (dynamic) High App inst time accurate
prof Low App inst time none
iprobe High Sys inst time inaccurate
Morph Low Sys inst time none
VTune (sampler) Low Sys inst time inaccurate
SpeedShop (timer and counters) Low Sys inst time inaccurate
DCPI Low Sys inst time accurate
Table 1: Profiling Systems

The systems fall into two groups. The first includes pixie[MIPS90], gprof[GraKM82], jprof[ReiS94], quartz[AndL91], MTOOL[GolH93], SimOS[RosHWG95], part of SGI's SpeedShop[Zag96], and Intel's VTune dynamic analyzer[VTune]. These systems use binary modification, compiler support, or direct simulation of programs to gather measurements. They all have high overhead and usually require significant user intervention. The slowdown is too large for continuous measurements during production use, despite techniques that reduce instrumentation overhead substantially[BalL94]. In addition, only the simulation-based systems provide accurate information about the locations and causes of stalls.

The systems in the second group use statistical sampling to collect fine-grained information on program or system behavior. Some sampling systems, including Morph[Zha97], prof[prof], and part of SpeedShop, rely on an existing source of interrupts (e.g., timer interrupts) to generate program-counter samples. This prevents them from sampling within those interrupt routines, and can also result in correlations between the sampling and other system activity. By using hardware performance counters and randomizing the interval between samples, we are able to sample activity within essentially the entire system (except for our interrupt handler itself) and to avoid correlations with any other activity. This issue is discussed further in Section 4.1.1

Other systems that use performance counters, including iprobe[Ipr], the VTune sampler[VTune], and part of SpeedShop, share some of the characteristics of our system. However, iprobe and VTune cannot be used for continuous profiling, mostly because they need a lot of memory for sample data. In addition, iprobe, the VTune sampler, and SpeedShop all fail to map the sample data accurately back to individual instructions. In contrast, our tools produce an accurate accounting of stall cycles incurred by each instruction and the potential reason(s) for the stalls.


Beginning of paper
Abstract
1. Introduction
2. Related Work
3. Data Analysis Examples
4. Data Collection System
5. Profiling Performance
6. Data Analysis Overview
7. Future Directions
8. Conclusions
Acknowledgements
References

To appear in the ACM Transactions on Computer Systems. This paper is a slightly revised version of a paper that will also appear in the 16th ACM Symposium on Operating Systems Principles, October 5-8, 1997, St. Malo, France. Copyright 1997 by ACM, Inc. All rights reserved. Republished by permission.