Performance Evaluation of the Piranha Memory Hierarchy

Julien Sebot
Universite Paris-Sud


 
 
1 Introduction

This summer, I explored how to improve TLB performance and second level cache fill policy in Piranha. Piranha is an 8-CPU on-chip multiprocessor targeted at database applications. The goal for this project is to achieve 2 times the On-Line Transaction Processing (OLTP) performance in half the time and with one tenth of the engineering effort, when compared to contemporary processor design efforts.

The Piranha system will be fully synthetized in an ASIC process. The Piranha processing node will include 8 simple one-way alpha cores running at 400MHz, 32kB direct mapped first-level instruction and data caches, a shared 1MB 8-way set-associative second-level cache, 8-way set-associative, memory controllers, and an interconnect subsystem that connects processing nodes together. 
 

2 Benchmarks

I used SimOS-Alpha and scaled-down TPC benchmarks for the performance evaluation. SimOS is a full-system simulation tool that models hardware in enough detail to boot an operating system. SimOS integrates several processor and memory systems simulators that I have used and improved. The TPC benchmarks we used are OLTP benchmarks called TPC-B and TPC-C. These are standard benchmarks used to model the activity of bank transactions and wholesale suppliers. For these programs, over 45% of the execution time is spent in the memory system in an architecture like Piranha.
 

3 Evaluation

One aspect of the memory system that has significant impact on performance is the address translation cache (TLB). The design constrains of the ASIC process to which Piranha is targeted prevents us from implementing a traditional fully-associative TLB. We have studied the impact of limited associativity on TLB performance, and concluuded that a 256-entry, 4-way set-associative TLB is 4% better than a 64-entry fully-associative TLB, for the scaled-down benchmarks at our disposal. This result is not definitive given that the TLB performance is affected by the scaled down nature of our benchmarks. 

Another important area of system design that strongly effects memory system performance is the second-level cache. In Piranha, the combined first-level cache size is 512kB, and the second-level cache is 1MB. The Piranha team has chosen to implement a non-inclusive, shared, second-level cache (shared victim cache) to avoid wasting space in the second-level cache. We have determined that the performance impact of this choice on Piranha performance ranges between 5% to 9%, and that the performance gains over a standard inclusive policy becomes negligible for second-level cache sizes of 2MB or greater. The intuition behind these results is that, even when inclusion not enforced by hardware, in practice there will be many times in which a line will be present both in the L2 and one or many L1s. For larger caches, the penalty of enforcing inclusion decreases as the fraction of replicated (L1) data in the L2 is reduced. 

In a non-inclusive cache hierarchy, the second-level cache is responsible for deciding when a L1 cache has to write back into the L2 (i.e., the L2 fill policy). We have evaluated the performance of Piranha's current fill policy with respect to two potentially "ideal" fill policies: one that is very eager and one that is as lazy as possible in sending write backs to the L2. Both eager and lazy policies are infeasible to implement since the amount of L2 state that needs to be inspected would cause extra delays in satisfying processor requests. Therefore the comparison aims only at determining how different our current (implementable) policy deviates from the ideal cases. The results show that the 3 policies never differ more than 3% in performance, which further corroborates the effectiveness of the current scheme being implemented in Piranha.