Continuous Profiling: Where Have All the Cycles Gone?
1. Introduction

The performance of programs running on modern high-performance computer systems is often hard to understand. Processor pipelines are complex, and memory system effects have a significant impact on performance. When a single program or an entire system does not perform as well as desired or expected, it can be difficult to pinpoint the reasons. The DIGITAL Continuous Profiling Infrastructure provides an efficient and accurate way of answering such questions.

The system consists of two parts, each with novel features: a data collection subsystem that samples program counters and records them in an on-disk database, and a suite of analysis tools that analyze the stored profile information at several levels, from the fraction of CPU time consumed by each program to the number of stall cycles for each individual instruction. The information produced by the analysis tools guides users to time-critical sections of code and explains in detail the static and dynamic delays incurred by each instruction.

We faced two major challenges in designing and implementing our profiling system: efficient data collection for a very high sampling rate, and the identification and classification of processor stalls from program-counter samples. The data collection system uses periodic interrupts generated by performance counters available on DIGITAL Alpha processors to sample program counter values. (Other processors, such as Intel's Pentium Pro and SGI's R10K, also have similar hardware support.) Profiles are collected for unmodified executables, and all code is profiled, including applications, shared libraries, device drivers, and the kernel. Thousands of samples are gathered each second, allowing useful profiles to be gathered in a relatively short time. Profiling is also efficient: overhead is about 1-3% of the processor time, depending on the workload. This permits the profiling system to be run continuously on production systems and improves the quality of the profiles by minimizing the perturbation of the system induced by profiling.

The collected profiles contain time-biased samples of program counter values: the number of samples associated with a particular program counter value is proportional to the total time spent executing that instruction. Samples that show the relative number of cache misses, branch mispredictions, etc. incurred by individual instructions may also be collected if the processor's performance counters support such events.

Some of the analysis tools use the collected samples to generate the usual histograms of time spent per image, per procedure, per source line, or per instruction. Other analysis tools use a detailed machine model and heuristics described in Section 6 to convert time-biased samples into the average number of cycles spent executing each instruction, the number of times each instruction was executed, and possible explanations for any static or dynamic stalls. Our techniques can deduce this information entirely from the time-biased program counter profiles and the binary executable, although the other types of samples, if available, may also be used to improve the accuracy of the results.

Section 3 contains several examples of the output from our tools. As discussed there, the combination of fine-grained instruction-level analysis and detailed profiling of long-running workloads has produced insights into performance that are difficult to achieve with other tools. These insights have been used to improve the performance of several major commercial applications.

The output of the analysis tools can be used directly by programmers; it can also be fed into compilers, linkers, post-linkers, and run-time optimization tools. The profiling system is freely available on the Web at http://www.research.digital.com/SRC/dcpi; it has been running on DIGITAL Alpha processors under DIGITAL Unix since September 1996, and ports are in progress to Alpha/NT and OpenVMS. Work is underway to feed the output of our tools into DIGITAL's optimizing backend[Bli92] and into the Spike/OM post-linker optimization framework[CohL96] [CohGLR97]. We are also studying new kinds of profile-driven optimizations made possible by the fine-grained instruction-level profile information provided by our system.

Section 2 discusses other profiling systems. Section 3 illustrates the use of our system. Section 4 and Section 5 describe the design and performance of our data collection system, highlighting the techniques used to achieve low overhead with a high sampling rate. Section 6 describes the subtle and interesting techniques used in our analysis tools, explaining how to derive each instruction's CPI, execution frequency, and explanations for stalls from the raw sample counts. Finally, Section 7 discusses future work and Section 8 summarizes our results.

Beginning of paper
Abstract
1. Introduction
2. Related Work
3. Data Analysis Examples
4. Data Collection System
5. Profiling Performance
6. Data Analysis Overview
7. Future Directions
8. Conclusions
Acknowledgements
References

To appear in the ACM Transactions on Computer Systems. This paper is a slightly revised version of a paper that will also appear in the 16th ACM Symposium on Operating Systems Principles, October 5-8, 1997, St. Malo, France. Copyright 1997 by ACM, Inc. All rights reserved. Republished by permission.

Continuous Profiling: Where Have All the Cycles Gone? 1. Introduction

Continuous Profiling: Where Have All the Cycles Gone?
1. Introduction