next up previous
Next: The Computational Context

Cache Miss Characterization and Data Locality Optimization for Imperfectly Nested Loops on Shared Memory Multiprocessors

Swarup Kumar Sahoo Rajkiran Panuganti Sriram Krishnamoorthy
P. Sadayappan
Parallel Compilers Laboratory

Dept. of Computer Science and Engineering
The Ohio State University
395 Dreese Lab., 2015 Neil Ave.
Columbus, OH 43210-1277, USA
{sahoo, panugant,krishnsr, saday}@cse.ohio-state.edu

Abstract:

This paper develops an algorithm to accurately characterize the number of cache misses for a class of compute-intensive calculations encountered in accurate quantum chemistry models of electronic structure. The proposed approach can handle imperfectly nested loop structures, symbolic loop bounds, and non-constant dependences for a constrained class of array references. It is proposed in the context of tensor contraction computations, and extends the work by Almasi et al.
citealmasi2003 and Cascaval et. al.
citecascaval2003. We illustrate the application of the approach for determination of effective tile sizes and parallelization on shared-memory parallel systems.

Introduction

The increasing gap between processor and memory performance makes data locality optimization crucially important for high-performance computing. Although considerable work has been performed on modeling cache behavior [7,12,13,21] and on frameworks for loop transformations [4,2,14,20], the use of accurate cache models that can be effectively used by compilers in performing loop transformations remains a difficult problem.

In this paper, we develop an approach to estimate the cache miss cost for a program with imperfectly-nested loops and non-constant dependences, based on the concept of stack distances. The iteration space is partitioned into sets such that all the iteration points in each set have the same stack distance. The total cache miss cost is then computed based on the knowledge of the partitioning, the stack distance for each set in the partition and the cache size. Our work is an extension of the work on stack distances by Almasi et al.
citealmasi2003 and Cascaval et al. [6]. Although their approach can handle affine array references, it is restricted to perfectly nested loops and constant dependences.

We illustrate the application of our approach to solving two important practical problems. First, we develop an effective search algorithm for determination of optimal tile sizes for imperfectly-nested loop code representative of a computational domain of particular interest to us - electronic structure models from quantum chemistry. The cache miss cost for the tiled program is modeled using our method. The property of the cache miss costs resulting from the computation is used to devise a search strategy that effectively prunes the search space to consider only the most promising tile sizes. Secondly, that code is parallelized to run on an SMP machine.

The paper is organized as follows. Section 2 discusses the context of this work, the tensor contraction engine (TCE). The prevalent models for estimating cache behavior are discussed in Section 3. Section 4 discusses the stack distance approach for cache miss estimation. Section 5 gives details of our algorithm. In Section 6, we describe an algorithm for tile size determination using our approach to stack distance characterization. Section 7 describes how our approach can be used for optimizing parallel performance on shared-memory multiprocessors. Finally, Section 8 concludes the paper.




next up previous
Next: The Computational Context
rajkiran panuganti 2005-05-12