Swarup Kumar Sahoo Rajkiran Panuganti Sriram Krishnamoorthy
P. Sadayappan
Parallel Compilers Laboratory
Dept. of Computer Science and Engineering
The Ohio State University
395 Dreese Lab., 2015 Neil Ave.
Columbus, OH 43210-1277, USA
{sahoo, panugant,krishnsr, saday}@cse.ohio-state.edu
The increasing gap between processor and memory performance makes data locality optimization crucially important for high-performance computing. Although considerable work has been performed on modeling cache behavior [7,12,13,21] and on frameworks for loop transformations [4,2,14,20], the use of accurate cache models that can be effectively used by compilers in performing loop transformations remains a difficult problem.
In this paper, we develop an approach to estimate the cache miss cost
for a program with imperfectly-nested loops and non-constant
dependences, based on the concept of stack distances. The
iteration space is partitioned into sets such
that all the iteration points in each set have the same stack
distance. The total cache miss cost is then computed based on
the knowledge of the partitioning, the stack distance for each set in
the partition and the cache size.
Our work is an extension of the work on stack distances
by Almasi et al.
citealmasi2003 and Cascaval et al. [6]. Although their approach can handle affine array references,
it is restricted to perfectly nested loops and constant dependences.
We illustrate the application of our approach to solving two important practical problems. First, we develop an effective search algorithm for determination of optimal tile sizes for imperfectly-nested loop code representative of a computational domain of particular interest to us - electronic structure models from quantum chemistry. The cache miss cost for the tiled program is modeled using our method. The property of the cache miss costs resulting from the computation is used to devise a search strategy that effectively prunes the search space to consider only the most promising tile sizes. Secondly, that code is parallelized to run on an SMP machine.
The paper is organized as follows. Section 2 discusses the context of this work, the tensor contraction engine (TCE). The prevalent models for estimating cache behavior are discussed in Section 3. Section 4 discusses the stack distance approach for cache miss estimation. Section 5 gives details of our algorithm. In Section 6, we describe an algorithm for tile size determination using our approach to stack distance characterization. Section 7 describes how our approach can be used for optimizing parallel performance on shared-memory multiprocessors. Finally, Section 8 concludes the paper.