Continuous Profiling: Where Have All the Cycles Gone?

3. Data Analysis Examples

Our system has been used to analyze and improve the performance of a wide range of complex commercial applications, including graphics systems, databases, industry benchmark suites, and compilers. For example, our tools pinpointed a performance problem in a commercial database system; fixing the problem reduced the response time of an SQL query from 180 to 14 hours. In another example, our tools' fine-grained instruction-level analyses identified opportunities to improve optimized code produced by DIGITAL's compiler, speeding up the mgrid SPECfp95 benchmark by 15%.

Our system includes a large suite of tools to analyze profiles at different levels of detail. In this section, we present several examples of the following tools:

Other tools annotate source and assembly code with sample counts, highlight the differences in two separate profiles for the same program, summarize where time is spent in an entire program (the percentage of cycles spent waiting for data-cache misses, etc.; see Figure 4 for an example of this kind of summary for a single procedure), translate profile data into pixie format, and produce formatted Postscript output of annotated control-flow graphs.

3.1. Procedure-Level Bottlenecks

Dcpiprof provides a high-level view of the performance of a workload. It reads a set of sample files and displays a listing of the number of samples per procedure, sorted by decreasing number of samples. (It can also list the samples by image, rather than by procedure.) Figure 1 shows the first few lines of the output of dcpiprof for a run of an X11 drawing benchmark. For example, the ffb8ZeroPolyArc routine accounts for 33.87% of the cycles for this workload. Notice that this profile includes code in the kernel (/vmunix) as well as code in shared libraries.


Total samples for event type cycles = 6095201, imiss = 1117002

The counts given below are the number of samples for each listed event type.
==============================================================================================
 cycles       %    cum%   imiss       % procedure              image
2064143  33.87%  33.87%   43443   3.89% ffb8ZeroPolyArc        /usr/shlib/X11/lib_dec_ffb_ev5.so
 517464   8.49%  42.35%   86621   7.75% ReadRequestFromClient  /usr/shlib/X11/libos.so
 305072   5.01%  47.36%   18108   1.62% miCreateETandAET       /usr/shlib/X11/libmi.so
 271158   4.45%  51.81%   26479   2.37% miZeroArcSetup         /usr/shlib/X11/libmi.so
 245450   4.03%  55.84%   11954   1.07% bcopy                  /vmunix
 209835   3.44%  59.28%   12063   1.08% Dispatch               /usr/shlib/X11/libdix.so
 186413   3.06%  62.34%   36170   3.24% ffb8FillPolygon        /usr/shlib/X11/lib_dec_ffb_ev5.so
 170723   2.80%  65.14%   20243   1.81% in_checksum            /vmunix
 161326   2.65%  67.78%    4891   0.44% miInsertEdgeInET       /usr/shlib/X11/libmi.so
 133768   2.19%  69.98%    1546   0.14% miX1Y1X2Y2InRegion     /usr/shlib/X11/libmi.so
Figure 1: The key procedures from an x11perf run.

3.2. Instruction-Level Bottlenecks

Dcpicalc provides a detailed view of the time spent on each instruction in a procedure. Figure 2 illustrates the output of dcpicalc for the key basic block in a McCalpin-like copy benchmark[McC95], running on an AlphaStation 500 5/333. The copy benchmark runs the following loop where n = 2000000 and the array elements are 64-bit integers:

  for (i = 0; i < n; i++)
    c[i] = a[i];
The compiler has unrolled the loop four times, resulting in four loads and stores per iteration. The code shown drives the memory system at full speed.


*** Best-case 8/13 =   0.62CPI
*** Actual  140/13 =  10.77CPI

Addr   Instruction       Samples   CPI  Culprit
    pD   p = branch mispredict
    pD   D = DTB miss
009810 ldq    t4, 0(t1)     3126   2.0cy
009814 addq   t0, 0x4, t0      0   (dual issue)
009818 ldq    t5, 8(t1)     1636   1.0cy
00981c ldq    t6, 16(t1)     390   0.5cy
009820 ldq    a0, 24(t1)    1482   1.0cy
009824 lda    t1, 32(t1)       0   (dual issue)
    dwD  d = D-cache miss
    dwD  ... 18.0cy
    dwD  w = write-buffer overflow
009828 stq    t4, 0(t2)    27766  18.0cy 9810
00982c cmpult t0, v0, t4       0   (dual issue)
009830 stq    t5, 8(t2)     1493   1.0cy
  s      s = slotting hazard
    dwD
    dwD ... 114.5cy
    dwD
009834 stq    t6, 16(t2)  174727 114.5cy 981c
  s
009838 stq    a0, 24(t2)    1548   1.0cy  
00983c lda    t2, 32(t2)       0   (dual issue)
009840 bne    t4, 0x009810  1586   1.0cy
Figure 2: Analysis of Copy Loop.

At the beginning of the basic block, dcpicalc shows summary information for the block. The first two lines display the best-case and actual cycles per instruction (CPI) for the block. The best-case scenario includes all stalls statically predictable from the instruction stream but assumes that there are no dynamic stalls (e.g., all load instructions hit in the D-cache). For the copy benchmark, we see that the actual CPI is quite high at 10.77, whereas the best theoretical CPI (if no dynamic stalls occurred) is only 0.62. This shows that dynamic stalls are the significant performance problem for this basic block.

Dcpicalc also lists the instructions in the basic block, annotated with information about the stall cycles (and program source code, if the image contains line number information). Above each assembly instruction that stalls, dcpicalc inserts bubbles to show the duration and possible cause of the stall. Each line of assembly code shows, from left to right, the instruction's address, the instruction, the number of PC samples at this instruction, the average number of cycles this instruction spent at the head of the issue queue, and the addresses of other instructions that may have caused this instruction to stall. Note that Alpha load and load-address instructions write their first operand; 3-register operators write their third operand.

Each line in the listing represents a half-cycle, so it is easy to see if instructions are being dual-issued. In the figure, we see that there are two large stalls, one for 18.0 cycles at instruction 009828, and another for 114.5 cycles at instruction 009834. The bubbles labeled dwD before the stalled stq instruction at 009828 indicate three possible reasons: a D-cache miss incurred by the ldq at 009810 (which provides the data needed by the stq), a write-buffer overflow, or a DTB miss. The stq instruction at 009834 is also stalled for the same three possible reasons. The lines labeled s indicate static stalls; in this case they are caused by the 21164 not being able to dual-issue adjacent stq instructions.

As expected, the listing shows that as the copy loop streams through the data the performance bottleneck is mostly due to memory latency. Also, the six-entry write buffer on the 21164 is not able to retire the writes fast enough to keep up with the computation. DTB miss is perhaps not a real problem since the loop walks through each page and may incur DTB misses only when crossing a page boundary. Dcpicalc will likely rule out DTB miss if given DTBMISS samples but lists it as a possibility here because our analysis is designed to make pessimistic assumptions when information is limited.

3.3. Comparing Performance

Several benchmarks that we used to analyze the performance of the data collection system showed a noticeable variance in running times across different runs. We used our tools to examine one of these benchmarks, wave5 from the sequential SPECfp95 workload, in more detail.

We ran wave5 on an AlphaStation 500 5/333 and observed running times that varied by as much as 11%. We ran dcpistats on 8 sets of sample files to isolate the procedures that had the greatest variance; dcpistats reads multiple sets of sample files and computes statistics comparing the profile data in the different sets. The output of dcpistats for wave5 is shown in Figure 3.


Number of samples of type cycles
set 1 = 860301   set 2 = 862645   set 3 = 871952   set 4 = 870780            TOTAL      7144601
set 5 = 942929   set 6 = 893154   set 7 = 890969   set 8 = 951871     

Statistics calculated using the sample counts for each procedure from 8 different sample set(s)
=============================================================================================
  range%        sum     sum%     N       mean    std-dev        min        max     procedure  
  11.32%  441040.00    6.17%     8   55130.00   21168.70   38155.00   88075.00     smooth_  
   1.44%   72385.00    1.01%     8    9048.12     368.74    8578.00    9622.00     fftb_    
   1.39%   71129.00    1.00%     8    8891.12     327.68    8467.00    9453.00     fftf_    
   0.94% 4242079.00   59.37%     8  530259.87   14097.11  515253.00  555180.00     parmvr_  
   0.68%  378622.00    5.30%     8   47327.75    1032.09   46206.00   48786.00     putb_    
   0.65%  410929.00    5.75%     8   51366.13    1161.61   50420.00   53110.00     vslv1p_  
Figure 3: Statistics across eight runs of the SPECfp benchmark wave5.

The figure shows the procedures in the wave5 program, sorted by the normalized range, i.e., the difference between the maximum and minimum sample counts for that procedure, divided by the sum of the samples. We see that the procedure smooth had a much larger range than any of the other procedures. Next, we ran dcpicalc on smooth for each profile, obtaining a summary of the fraction of cycles consumed by each type of dynamic and static stall within the procedure.

The summary for the fastest run (the profile with the fewest samples) is shown in Figure 4. The summary for the slowest run (not shown) shows that the percentages of stall cycles attributed to D-cache miss, DTB miss, and write buffer overflow increase dramatically to 44.8-44.9%, 14.0-33.9%, and 0.0-18.3% respectively. The increase is probably in part due to differences in the virtual-to-physical page mapping across the different runs--if different data items are located on pages that map to the same location in the board cache, the number of conflict misses will increase.


 *** Best-case 14686/36016 =   0.41CPI,
 *** Actual 35171/36016 =   0.98CPI
 *** 
 ***  I-cache (not ITB)   0.0% to  0.3%
 ***   ITB/I-cache miss   0.0% to  0.0%
 ***       D-cache miss  27.9% to 27.9%
 ***           DTB miss   9.2% to 18.3%
 ***       Write buffer   0.0% to  6.3%
 ***    Synchronization   0.0% to  0.0%
 *** 
 ***  Branch mispredict   0.0% to  2.6%
 ***          IMUL busy   0.0% to  0.0%
 ***          FDIV busy   0.0% to  0.0%
 ***              Other   0.0% to  0.0%
 *** 
 ***  Unexplained stall   2.3% to  2.3%
 ***   Unexplained gain  -4.3% to -4.3%
 *** ----------------------------------------
 ***   Subtotal dynamic                 44.1%
 *** 
 ***           Slotting       1.8%
 ***      Ra dependency       2.0%
 ***      Rb dependency       1.0%
 ***      Rc dependency       0.0%
 ***      FU dependency       0.0%
 *** ----------------------------------------
 ***    Subtotal static                  4.8%
 *** ----------------------------------------
 ***        Total stall                 48.9%
 ***          Execution                 51.2%
 *** Net sampling error                 -0.1%
 *** ----------------------------------------
 ***      Total tallied                100.0%
 *** (35171, 93.1% of all samples)
Figure 4: Summary of how cycles are spent in the procedure smooth for the fast run of the SPECfp95 benchmark wave5.

Beginning of paper
Abstract
1. Introduction
2. Related Work
3. Data Analysis Examples
4. Data Collection System
5. Profiling Performance
6. Data Analysis Overview
7. Future Directions
8. Conclusions
Acknowledgements
References

This paper was published in the Proceedings of the 16th ACM Symposium on Operating Systems Principles, October, 1997. Copyright 1997 by the Assocation for Computing Machinery. All rights reserved. Republished by permission.