3. Data Analysis Examples
Our system has been used to analyze and improve the performance of a wide range of complex commercial applications, including graphics systems, databases, industry benchmark suites, and compilers. For example, our tools pinpointed a performance problem in a commercial database system; fixing the problem reduced the response time of an SQL query from 180 to 14 hours. In another example, our tools' fine-grained instruction-level analyses identified opportunities to improve optimized code produced by DIGITAL's compiler, speeding up the mgrid SPECfp95 benchmark by 15%.
Our system includes a large suite of tools to analyze profiles at different levels of detail. In this section, we present several examples of the following tools:
Dcpiprof provides a high-level view of the performance of a workload.
It reads a set of sample files and displays a listing of the number of
samples per procedure, sorted by decreasing number of samples. (It
can also list the samples by image, rather than by procedure.)
Figure 1 shows the first few lines of the
output of dcpiprof for a run of an X11 drawing benchmark. For
example, the ffb8ZeroPolyArc
routine accounts for 33.87% of the
cycles for this workload. Notice that this profile includes code in
the kernel (/vmunix
) as well as code in shared libraries.
Total samples for event type cycles = 6095201, imiss = 1117002 The counts given below are the number of samples for each listed event type. ============================================================================================== cycles % cum% imiss % procedure image 2064143 33.87% 33.87% 43443 3.89% ffb8ZeroPolyArc /usr/shlib/X11/lib_dec_ffb_ev5.so 517464 8.49% 42.35% 86621 7.75% ReadRequestFromClient /usr/shlib/X11/libos.so 305072 5.01% 47.36% 18108 1.62% miCreateETandAET /usr/shlib/X11/libmi.so 271158 4.45% 51.81% 26479 2.37% miZeroArcSetup /usr/shlib/X11/libmi.so 245450 4.03% 55.84% 11954 1.07% bcopy /vmunix 209835 3.44% 59.28% 12063 1.08% Dispatch /usr/shlib/X11/libdix.so 186413 3.06% 62.34% 36170 3.24% ffb8FillPolygon /usr/shlib/X11/lib_dec_ffb_ev5.so 170723 2.80% 65.14% 20243 1.81% in_checksum /vmunix 161326 2.65% 67.78% 4891 0.44% miInsertEdgeInET /usr/shlib/X11/libmi.so 133768 2.19% 69.98% 1546 0.14% miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so
Dcpicalc provides a detailed view of the time spent on each
instruction in a procedure. Figure 2
illustrates the output of dcpicalc for the key basic block in a
McCalpin-like copy benchmark[McC95],
running on an AlphaStation 500 5/333.
The copy benchmark runs the following loop where n
= 2000000 and
the array elements are 64-bit integers:
for (i = 0; i < n; i++) c[i] = a[i];The compiler has unrolled the loop four times, resulting in four loads and stores per iteration. The code shown drives the memory system at full speed.
*** Best-case 8/13 = 0.62CPI *** Actual 140/13 = 10.77CPI Addr Instruction Samples CPI Culprit pD p = branch mispredict pD D = DTB miss 009810 ldq t4, 0(t1) 3126 2.0cy 009814 addq t0, 0x4, t0 0 (dual issue) 009818 ldq t5, 8(t1) 1636 1.0cy 00981c ldq t6, 16(t1) 390 0.5cy 009820 ldq a0, 24(t1) 1482 1.0cy 009824 lda t1, 32(t1) 0 (dual issue) dwD d = D-cache miss dwD ... 18.0cy dwD w = write-buffer overflow 009828 stq t4, 0(t2) 27766 18.0cy 9810 00982c cmpult t0, v0, t4 0 (dual issue) 009830 stq t5, 8(t2) 1493 1.0cy s s = slotting hazard dwD dwD ... 114.5cy dwD 009834 stq t6, 16(t2) 174727 114.5cy 981c s 009838 stq a0, 24(t2) 1548 1.0cy 00983c lda t2, 32(t2) 0 (dual issue) 009840 bne t4, 0x009810 1586 1.0cy
At the beginning of the basic block, dcpicalc shows summary information for the block. The first two lines display the best-case and actual cycles per instruction (CPI) for the block. The best-case scenario includes all stalls statically predictable from the instruction stream but assumes that there are no dynamic stalls (e.g., all load instructions hit in the D-cache). For the copy benchmark, we see that the actual CPI is quite high at 10.77, whereas the best theoretical CPI (if no dynamic stalls occurred) is only 0.62. This shows that dynamic stalls are the significant performance problem for this basic block.
Dcpicalc also lists the instructions in the basic block, annotated with information about the stall cycles (and program source code, if the image contains line number information). Above each assembly instruction that stalls, dcpicalc inserts bubbles to show the duration and possible cause of the stall. Each line of assembly code shows, from left to right, the instruction's address, the instruction, the number of PC samples at this instruction, the average number of cycles this instruction spent at the head of the issue queue, and the addresses of other instructions that may have caused this instruction to stall. Note that Alpha load and load-address instructions write their first operand; 3-register operators write their third operand.
Each line in the listing represents a half-cycle, so it is easy
to see if instructions are being dual-issued. In the figure, we
see that there are two large stalls, one for 18.0 cycles at
instruction 009828, and another for 114.5 cycles at instruction
009834. The bubbles labeled dwD
before the stalled
stq
instruction at 009828 indicate three possible reasons:
a D-cache miss incurred by the ldq
at 009810 (which provides the
data needed by the stq
), a write-buffer overflow, or a DTB
miss. The stq
instruction at 009834 is also stalled for the same
three possible reasons.
The lines labeled s
indicate static stalls; in this
case they are caused by the 21164 not being able to dual-issue
adjacent stq
instructions.
As expected, the listing shows that as the copy loop streams through the data the performance bottleneck is mostly due to memory latency. Also, the six-entry write buffer on the 21164 is not able to retire the writes fast enough to keep up with the computation. DTB miss is perhaps not a real problem since the loop walks through each page and may incur DTB misses only when crossing a page boundary. Dcpicalc will likely rule out DTB miss if given DTBMISS samples but lists it as a possibility here because our analysis is designed to make pessimistic assumptions when information is limited.
Several benchmarks that we used to analyze the performance of the data collection system showed a noticeable variance in running times across different runs. We used our tools to examine one of these benchmarks, wave5 from the sequential SPECfp95 workload, in more detail.
We ran wave5 on an AlphaStation 500 5/333 and observed running times that varied by as much as 11%. We ran dcpistats on 8 sets of sample files to isolate the procedures that had the greatest variance; dcpistats reads multiple sets of sample files and computes statistics comparing the profile data in the different sets. The output of dcpistats for wave5 is shown in Figure 3.
Number of samples of type cycles set 1 = 860301 set 2 = 862645 set 3 = 871952 set 4 = 870780 TOTAL 7144601 set 5 = 942929 set 6 = 893154 set 7 = 890969 set 8 = 951871 Statistics calculated using the sample counts for each procedure from 8 different sample set(s) ============================================================================================= range% sum sum% N mean std-dev min max procedure 11.32% 441040.00 6.17% 8 55130.00 21168.70 38155.00 88075.00 smooth_ 1.44% 72385.00 1.01% 8 9048.12 368.74 8578.00 9622.00 fftb_ 1.39% 71129.00 1.00% 8 8891.12 327.68 8467.00 9453.00 fftf_ 0.94% 4242079.00 59.37% 8 530259.87 14097.11 515253.00 555180.00 parmvr_ 0.68% 378622.00 5.30% 8 47327.75 1032.09 46206.00 48786.00 putb_ 0.65% 410929.00 5.75% 8 51366.13 1161.61 50420.00 53110.00 vslv1p_
The figure shows the procedures in the wave5 program, sorted by the
normalized range, i.e., the difference between the maximum and
minimum sample counts for that procedure, divided by the sum of the
samples. We see that the procedure smooth
had a much larger
range than any of the other procedures. Next, we ran dcpicalc on
smooth
for each profile, obtaining a summary of the fraction of
cycles consumed by each type of dynamic and static stall within the
procedure.
The summary for the fastest run (the profile with the fewest samples) is shown in Figure 4. The summary for the slowest run (not shown) shows that the percentages of stall cycles attributed to D-cache miss, DTB miss, and write buffer overflow increase dramatically to 44.8-44.9%, 14.0-33.9%, and 0.0-18.3% respectively. The increase is probably in part due to differences in the virtual-to-physical page mapping across the different runs--if different data items are located on pages that map to the same location in the board cache, the number of conflict misses will increase.
*** Best-case 14686/36016 = 0.41CPI, *** Actual 35171/36016 = 0.98CPI *** *** I-cache (not ITB) 0.0% to 0.3% *** ITB/I-cache miss 0.0% to 0.0% *** D-cache miss 27.9% to 27.9% *** DTB miss 9.2% to 18.3% *** Write buffer 0.0% to 6.3% *** Synchronization 0.0% to 0.0% *** *** Branch mispredict 0.0% to 2.6% *** IMUL busy 0.0% to 0.0% *** FDIV busy 0.0% to 0.0% *** Other 0.0% to 0.0% *** *** Unexplained stall 2.3% to 2.3% *** Unexplained gain -4.3% to -4.3% *** ---------------------------------------- *** Subtotal dynamic 44.1% *** *** Slotting 1.8% *** Ra dependency 2.0% *** Rb dependency 1.0% *** Rc dependency 0.0% *** FU dependency 0.0% *** ---------------------------------------- *** Subtotal static 4.8% *** ---------------------------------------- *** Total stall 48.9% *** Execution 51.2% *** Net sampling error -0.1% *** ---------------------------------------- *** Total tallied 100.0% *** (35171, 93.1% of all samples)
smooth
for the fast run of the SPECfp95 benchmark wave5.
Beginning of paper
Abstract
1. Introduction
2. Related Work
3. Data Analysis Examples
4. Data Collection System
5. Profiling Performance
6. Data Analysis Overview
7. Future Directions
8. Conclusions
Acknowledgements
References