TR-09-5.pdf

"Soft-OLP: improving hardware cache performance through software-controlled
object-level partitioning"

Qingda Lu, Jiang Lin, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and 
P. Sadayappan 

Proceedings of 18th International Conference on Parallel Architectures and  
Compilation Techniques (PACT 2009), Raleigh, North Carolina, September 12-16, 
2009. 


Abstract

Performance degradation of memory-intensive
programs caused by the LRU policy's inability to handle weaklocality
data accesses in the last level cache is increasingly
serious for two reasons. First, the last-level cache remains in
the CPU˘s critical path, where only simple management mechanisms,
such as LRU, can be used, precluding some sophisticated
hardware mechanisms to address the problem. Second, the
commonly used shared cache structure of multi-core processors
has made this critical path even more performance-sensitive
due to intensive inter-thread contention for shared cache
resources. Researchers have recently made efforts to address
the problem with the LRU policy by partitioning the cache
using hardware or OS facilities guided by run-time locality
information. Such approaches often rely on special hardware
support or lack enough accuracy. In contrast, for a large
class of programs, the locality information can be accurately
predicted if access patterns are recognized through small
training runs at the data object level.
To achieve this goal, we present a system-software framework
referred to as Soft-OLP (Software-based Object-Level
cache Partitioning). We first collect per-object reuse distance
histograms and inter-object interference histograms via
memory-trace sampling. With several low-cost training runs,
we are able to determine the locality patterns of data objects.
For the actual runs, we categorize data objects into different
locality types and partition the cache space among data objects
with a heuristic algorithm, in order to reduce cache misses
through segregation of contending objects. The object-level
cache partitioning framework has been implemented with a
modified Linux kernel, and tested on a commodity multi-core
processor. Experimental results show that in comparison with
a standard L2 cache managed by LRU, Soft-OLP significantly
reduces the execution time by reducing L2 cache misses across
inputs for a set of single- and multi-threaded programs from
the SPEC CPU2000 benchmark suite, NAS benchmarks and a
computational kernel set.