navarro

Feedback-directed binary code specialization

Juan Navarro
Rice University
with
Sharon Smith, David Hunter
Alpha Technology Solutions

1. Introduction

The current approach to binary performance optimization is to apply all known or available optimizations at once. The problem is that there are complex -- sometimes negative -- interactions between individual optimizations, and deciding what subset to apply is a difficult task.

The use of profile-directed feedback to make that decision can help, because optimization is done in accordance to the actual program behavior. In addition, run-time information that is not available to the compiler can be used to discover new optimization opportunities.

Specialization or partial evaluation is an optimization technique that eliminates unnecessary generality from an application; it can benefit from profile-directed feedback to identify that excess generality.

The goal of this project is to study the potential of specialization in isolation of other optimizations. In the long term, the results are intended to help in realizing cost-benefit analysis of specialization, which allows for adaptation to specific workloads.

2. Spector

Spector is an off-line binary specializer that we built for this project. It optimizes entire functions for a particular value of one of its arguments.

The binary is first profiled to determine what the hot functions are. Then, hot functions are instrumented at the entry point to detect frequent values passed in the arguments. The instrumented program is run and the results are used to select what functions to specialize and for what argument and value.

To specialize a function, Spector builds the function's control-flow graph, propagates known values through the graph by interpreting each basic block's code, removes unreachable blocks, and deletes instructions mainly by means of constant folding. Guard code is inserted to fall back to the original function when not in the presence of the special case. If not enough instructions are deleted, the specialization for that function is discarded. A better approach would be to try to optimize for a different argument, but Spector currently does not do that.

Spector uses Atom to profile, instrument, and navigate the binary. Ideally, L'Atom (a Linux port and extension of Atom, that supports arbitrary modifications) would be used to rewrite the binary, but it was not available when Spector was built. The temporary solution was to use the application's source code (which will not be required once L'Atom is available) to produce assembly code. Since there is a one-to-one correspondence between machine instruction in the binary file and assembly instructions in the assembly listing, an ed script line is generated for each modification. Perl code is used to glue everything together.

Due to a bug in the compiler, this approach didn't work for some programs, especially large ones. When a workaround was found it was to late to try it.

3. Results

The table below shows the results for 6 small benchmark programs. The column "%" shows the percentage of cycles spent in the hottest specialized function. Column "Deleted" tells how many instructions where deleted from that function, out of the total number of instructions in the original version of the function.

Program	%	Deleted Speedup	(%)
fib	100.0	5/29	-11.1
hanoi	100.0	11/53	-1.5
linpack	87.0	6/57	0.0
sim	15.5	4/857	+1.0
fft	56.0	2/193	+1.8
dhrystone	10.8	7/53	+3.7

The specialized functions in fib and hanoi are recursive ones, and Spector specialized them for the base case. As a consequence, the overhead of the guard code is paid for nothing. This problem is easy to solve.

The noise in the experiments was negligible; therefore, a 1% speedup, as small as it is, does represent an improvement because it's clearly above the noise.

4. Conclusions and future work

More experimentation is required to better assess specialization's potential and discover formulas or heuristics to predict whether a given specialization would yield speedups or slowdowns. From the table above, the number of instructions deleted is clearly useless as a predictor.

Many improvements can be done, including the following:

- There is a long list of special cases that allow for more instruction deletions and are not currently being detected.
- Code modification introduces new optimization opportunities. A final pass with standard optimizations such as instruction rescheduling and dead code removal would most likely pay off.
- Polyvariant specialization: specialize for more that one argument per function or more than one value per argument.
- Specialize also the original code for the complement of the special case, since it will never be executed when the special case holds.
- Floating point operations were not considered at this time.
- Ignoring function boundaries and specializing for sets of basic blocks instead, would increase specialization opportunities.