The goal of my summer work was to initiate analysis techniques for the ProfileMe data generated by Alpha 21264A processors. The Continuous Profiling Infrastructure was able to display raw events and aggregate event samples in simple ways, but no code existed for explaining static and dynamic stalls, calculating the cost of traps, and other important analyses.
The early part of the summer was consumed by becoming aquainted both with the large existing code-base, and the architecture of the 21264A. Technical documentation was certainly useful, but the best learning tool turned out to be the register-transfer-level simulator GUI. Simple code sequences could be simulated, with some of the internal processor state displayed in graphical form. Since CPI analysis techniques are based on a solid understanding of the inner dynamics of the chip being profiled, this background work was necessary for my project.
The first problem I tackled was one of software engineering. Adding specialized analysis routines for ProfileMe data illustrated the need to create an interface for determining what analyses could be performed given particular samples. For example, neither of the primary data analysis methods (execution count estimation and blame assignment) were initially supported for ProfileMe samples.
The obvious follow-on to this project was supporting execution count estimation using ProfileMe samples. This was easily accomplished, since the number of ProfileMe samples for a given instruction is directly proportional to the number of times it executed! Event-based sampling necessitated tricky heuristics for deriving the execution count from the number of samples.
The second phase of the project involved trying to "recover" cycles which were impossible to measure using the ProfileMe hardware as implemented. Several conditions cause the "retire delay" number reported to be shorter than it actually was. By consulting the RTL simulator and hardware specification, we were able to determine where a portion of the missing cycles were being spent.
The bulk of the summer ended up being focused on the analysis of ProfileMe data about "_trapping_" and "_aborted_" instructions. In an out-of-order processor, significant amounts of time may be spent on instructions that are ultimately discarded. Raw data from ProfileMe can identify instructions that frequently trap, but cannot convey the cost of the traps, and may not even clearly indicate the cause of the trap!
One technique we developed was on off-line algorithm for processing trap and abort samples to estimate how many aborts were due to each trapping instruction; this is a good first step at determining the cost of each trap. The success of this method needs to be analyzed better. Because of the statistical nature of the data we gather, and the complexity of run-time behavior, it is very difficult to correlate aborts with the traps that caused them. With more time and clever heuristics, I'm sure this tool will prove to be useful.
A second technique focused on replay traps, where two memory operations interact dynamically in such a way that bad data may be seen by future instructions. To fix such traps, one must know the identity of both instructions involved; only one of the instructions is directly identified by the ProfileMe hardware. We came up with a novel solution for this problem.
I found SRC to be a wonderful environment, full of very friendly and brilliant people, focused on a variety of important problems, and possessing a spirit of collaboration that was refreshing. Much credit goes to Sharon Perl, for making things run so smoothly. In addition, many thanks to my hosts, Bill Weihl and Mark Vandevoorde, and the other great people with whom I got to work closely.
The MS Powerpoint presentation entitled Using Interpretation for Profiling the Alpha 21264a provides additional details on the above work.