next up previous
Next: Optimizing Parallel Performance on Up: Cache Miss Characterization and Previous: Cache Miss Characterization and


The Computational Context

\begin{codespace}
% latex2html id marker 2579\centering\begin{minipage}{1.6in}...
...ments. Loops
$i$\ and $n$\ are fused to reduce $T$\ to a scalar.}\end{codespace}

The optimization presented in this paper has been developed in the context of the Tensor Contraction Engine (TCE) [5,10], a domain-specific compiler for ab initio quantum chemistry calculations. The TCE takes as input a high-level specification of a computation expressed as a set of tensor contraction expressions and transforms it into efficient parallel code. Several compile-time optimizations are incorporated into the TCE: algebraic transformations to minimize operation counts [18,19], loop fusion to reduce memory requirements [15,17,16], space-time trade-off optimization [8], communication minimization [9], and data locality optimization [10,11] of memory-to-cache traffic.

A tensor contraction expression is comprised of a collection of multi-dimensional summations of the product of several input arrays. As an example, consider the following contraction, used often in quantum chemistry calculations to transform a set of two-electron integrals from an atomic orbital (AO) basis to a molecular orbital (MO) basis:

$B(a,b,c,d)= \sum_{p,q,r,s} C1(s,d) \times C2(r,c) \times
C3(q,b) \times C4(p,a) \times A(p,q,r,s)$
This contraction is referred to as a four-index transform. Here, $A(p,q,r,s)$ is a four-dimensional input array and $B(a,b,c,d)$ is the transformed output array. The arrays $C1$ through $C4$ are called transformation matrices. In practice, these four arrays are identical; we identify them by different names in order to be able to distinguish them in the text.

The indices $p$, $q$, $r$, and $s$ have the same range $N$, denoting the total number of orbitals, which is equal to $O +
V$. $O$ denotes the number of occupied orbitals and $V$ denotes the number of unoccupied (virtual) orbitals. Likewise, the index ranges for $a$, $b$, $c$, and $d$ are the same, and equal to $V$. Typical values for $O$ range from 10 to 300; the number of virtual orbitals $V$ is usually between 50 and 1000.

The calculation of $B$ is done in the following four steps to reduce the number of floating point operations from $O(V^4 N^4)$ in the initial formula (8 nested loops, for $p$, $q$, $r$, $s$, $a$, $b$, $c$, and $d$) to $O(V N^4)$:

\begin{figure}{\small\begin{displaymath}
\begin{array}{l}
B(a,b,c,d) =\\
\sum_{...
...es A(p,q,r,s) \right) \right) \right)
\end{array}\end{displaymath}}
\end{figure}

This operation-minimization transformation results in the creation of three intermediate arrays:

\begin{eqnarray*}
T1(a,q,r,s) &=& \sum_{p} C4(p,a) \times A(p,q,r,s) \\
T2(a,...
...,q,r,s) \\
T3(a,b,c,s) &=& \sum_{r} C2(r,c) \times T2(a,b,r,s)
\end{eqnarray*}



For illustration purposes, we focus on the following contraction (a two-index transform):

\begin{displaymath}
B(m,n)=
\sum_{i,j} C1(m,i) \times C2(n,j)
\times A(i,j)
\end{displaymath}

The operation minimal form of the two-index transform and the corresponding intermediate array are as follows:

\begin{eqnarray*}
B(m,n) &=& \sum_{i} C1(m,i) \times (\sum_{j} C2(n,j) \times A(i,j) ) \\
T(n,i) &=& \sum_{j} C2(n,j) \times A(i,j)
\end{eqnarray*}



In general, the intermediate arrays can be too large to fit into main memory. Fig. 1 shows the computation of the array $B$ and illustrates how memory requirements for the computation of $B$ may be reduced using loop fusion. In Fig. 1(b), each loop nest is abbreviated as a single ``For'' loop with a sequence of indices. Fig. 1(c) illustrates the result of loop fusion. Note that all loops in each of the two loop nests in Fig. 1(a) are fully permutable and there are no fusion-preventing dependences between the loops. Hence, the common loops $i$ and $n$, shown underlined, can be fused. After loop fusion, the storage requirements for $T$ can be reduced, because there is no longer a need for an explicit dimension of $T$ corresponding to any loop indices that are fused between the producer of $T$ and the consumer of $T$ -- storage elements can be reused over sequential iterations of fused loops. In this example, $T$ can be contracted to a scalar as shown in Fig. 1(c).

The output of fusion step on the TCE synthesis process is thus a sequence of imperfectly nested loops with potentially symbolic loop bounds. Note that the array reference expressions are individual loop indices. Non-constant dependences result from the absence of a loop index in a array reference. In this paper, we evaluate the cache miss cost for this class of computations. The application of our model to tiling and parallelization are also done in this context.

Models of Cache Behavior

In this section, we shall discuss some of the prevalent cache behavior models and compare their characteristics.

One of the most common models of cache behavior is reuse distance. Reuse distance refers to the number of iterations between two successive references to the same element of an array. If the number of intervening iterations is small, there is a high likelihood that the element will be retained in cache between the references, i.e. the reuse in the program translates into locality of reference. Thus, the smaller the reuse distance, the greater the expected cache locality, and the lower the cache miss cost. Computing reuse distances and guiding loop transformations based on it has been addressed [1]. But improvements in reuse distance may not necessarily translate to improvements in cache miss cost.

Capacity misses have been used as a measure of modeling cache behavior in the context of a domain-specific compiler for determining optimal tile sizes [10]. It involves determining the total number of distinct memory accesses within a set of loop iterations. The total capacity miss cost is evaluated as the number of iterations of a loop nest within which the number of distinct memory locations accessed exceeds the cache size. The imperfectly nested loop case is handled similarly. The program is transformed such that the program reuse is moved within the innermost loops, thus reducing the number of capacity misses. This metric does not consider interferences between array references. Also, though the total number of memory locations accessed may exceed the cache size, some of the array references might still exhibit good locality.

One of the more accurate methods for analyzing cache behavior is based the concept of stack distances [6]. Stack distance refers to the number of distinct memory accesses between two consecutive references to the same location. It captures the memory access pattern of a program, including interference between array references, from a memory trace. A compile-time characterization of the stack distances in a program can directly be used to determine the number of cache misses, assuming of a fully associative cache with LRU replacement policy.

Determining the stack distances for an arbitrary program is extremely difficult. In general, the program might have to be executed to obtain the memory trace. But, for programs that have regular access patterns, the stack distances can potentially be estimated at compile-time, providing an accurate measure of the cache miss cost. We base our model on the stack distances approach.

Though there has been work on more accurate analysis of cache miss cost, taking into account associativity and the resulting conflict misses [7,12,13], they have not been amenable to use in a model-driven optimizing compiler.

Illustration of Cache Miss Prediction using Stack Distance Computation We are using a stack distance based approach for cache miss prediction. To predict cache misses at compile time, we need to compute the stack distances for each of the references. We split the iteration space of each reference into several components to effectively compute the stack distances. In each of the these components, all reference instances will have same incoming dependences and hence will have the same stack distance. So we need to compute the stack distance for each of the components of each reference separately. The number of distinct memory locations accessed between the source and target of the reuse will give the stack distance of each component. After we determine the stack distances, it is straight forward to calculate the number of cache misses. All the references for which stack distance is greater than cache size will be cache misses and all other references will be cache hits. These stack distance expressions can also be used to predict good tile sizes. Our algorithm can handle symbolic loop bounds and imperfectly nested loops where array index expressions are loop indices. We shall first describe our overall algorithm using the matrix multiplication code fragment in the next section. Matrix Multiplication Example In this section, we present the example of matrix multiplication to illustrate the working of our overall algorithm. figure*[tbh]



minipage tabbing xxxxxxxxxxxxxxxxxxxxxxxx
for i $1 \rightarrow \frac{N}{T_i}$
for j $1 \rightarrow \frac{N}{T_j}$
for k $1 \rightarrow \frac{N}{T_k}$
for $i_T 1\rightarrow T_i $
for $j_T 1\rightarrow T_j $
for $k_T 1\rightarrow T_k $
$A[(i*T_i+ i_T)][(j*T_j+ j_T)] += $
$B[(i*T_i+i_T)][(k*T_k+ k_T)] * C[(k*T_k+ k_T)][(j*T_j+ j_T)]; $
end for
end for
end for
end for
end for
end for

Consider the tiled matrix multiplication code as shown in Figure  2. First we identify all the reuses and partition the iteration space into the components such that all instances of a particular reference in it have same stack distance. We then compute stack distance i.e. the number of distinct array references between the source and the target for each reuse.

Once all the reuses are identified, our algorithm proceeds as follows:
Step 1: Partitioning the iteration Space: In this case, the iteration space can be partitioned into nine elementary partitions. In three of the elementary components, the instances of A, B, and C respectively have no incoming dependence i.e. they have no reuse. The instances corresponding to $\vec{I}$ = (a,b,1,c,d,1) of A, $\vec{I}$ = (a,1,b,c,1,d) of B and $\vec{I}$ = (1,a,b,1,c,d) of C, where a,b,c,d are any indices within the limits of the corresponding loop bounds, have no reuses. In the other six elementary components, the reuses are present. The source and the target iteration vectors for these partitions are shown in the Table  1. Step 2: Compute the Stack Distance for each component of each reference : In our example, there are six reuses. We determine the dependence span (the iterations starting from source to target of the reuse) for each of these reuses and identify the array elements accessed in those iterations. Let cost of an array with respect to a reuse be the number of distinct memory locations of that array accessed from source iteration vector to the target iteration vector of the reuse. Now the algorithm computes the cost of each array. The algorithm then computes the sum of costs of each array accessed by the iterations within the dependence span of the reuse. The number of array elements accessed is a function of the source and target iteration of the reuse. For this example, the source and target of all the reuses are shown in the Table 1. Note that the these are computed symbolically and it represents all instances of the array reference targeted by the dependence. The instances corresponding to $\vec{I}$ = (a,b,1,c,d,1) of A, $\vec{I}$ = (a,1,b,c,1,d) of B and $\vec{I}$ = (1,a,b,1,c,d) of C, where a,b,c,d are any indices within the limits of the corresponding loop bounds, have no incoming dependences. Hence the stack distances for these instances would be $\infty$. For all the other reuses, we calculate the cost for each array in the dependence span and then sum these costs for all arrays that are accessed between the reuse. Notice that the cost for each array is computed symbolically. The stack distances computed for this example is shown in table  1. Now using the stack distance expressions, we will evaluate the optimal tile size as described in the section  6.

Table 1: Stack Distance Computation for Matrix Multiplication
{ Source $\rightarrow$ Target } #references Stack Distance
{((a,b,1,d,e,1)} $N^2$ $\infty$
{(a,b,c,d,e,x)$\rightarrow$(a,b,c,d,e,x+1)} $N^3 - \frac{N^3}{N_k} $ 3
{(a,b,x,d,e,N)$\rightarrow$(a,b,x+1,d,e,1)} $N^2*(\frac{N}{N_k} - 1)$ A: $N_i$*$N_j$ ,B: $N_k$+ $N_i$*$N_k$
{ (a,1,b,c,1,d)} $N^2$ $\infty$
{(a,b,c,d,x,e)$\rightarrow$(a,b,c,d,x+1,e)} $N^3 - \frac{N^3}{N_j} $ A:2, B:$N_k$, C:$N_k$
{(a,x,c,d,N,e)$\rightarrow$(a,x+1,c,d,1,e)} $N^2*(\frac{N}{N_j} - 1)$ A:2*$N_i$*$N_j$, B:N*$N_i$, C:$N_j$*$N_k$+N*$N_j$
{(1,a,b,1,c,d)} $N^2$ $\infty$
{(a,b,c,x,d,e)$\rightarrow$(a,b,c,x+1,d,e)} $N^3 - \frac{N^3}{N_i} $ A:$N_j$+2,B:2*$N_k$,C:$N_j$*$N_k$
{(x,b,c,N,d,e)$\rightarrow$(x+1,b,c,1,d,e)} $N^2*(\frac{N}{N_i} - 1)$ A:N*$N_i$, B:$N_i$*$N_j$,C:N*N


Algorithm Description In this section, we formally describe our approach to partition the iteration space and estimate the number of cache misses using the stack distance approach.

Partitioning the Iteration Space figure*[tbh]



minipage tabbing xxxxxxxxxxxxxxxxxxxxxxxx
Function Partition(node, ref, index, current_child_no)
{
if ( node has a left sibling )
{
let n be the the immediate left sibling of the current node.
m = rightmost leaf node of the subtree rooted at n.
source = m.Ref
target = ref
ComputeIterations(source,target,index,left)
output((source.StmtNo,source.Iters)$\rightarrow$(target.StmtNo, target.Iters)
Partition(m, m.Ref, m.Ref.LoopDepth, 0)
return
}
else {
if (ref.Appears[index] = false)
let m = rightmost leaf node of the subtree rooted at node.
source = m.Ref
target = ref
ComputeIterations(source,target,index,right)
output((source.StmtNo,source.Iters)$\rightarrow$(target.StmtNo,target.Iters)
}
if (n != "Root" ) Partition(node.Parent, ref, index-1,node.SeqNo)
else
{
output( All iteration instances of ref with non-appearing indices as 1)
// Note these instances will have no reuse.
//Stack distances corresponding to these iteration instances will be $\infty$
}
}


Function ComputeIterations(ref1, ref2, index, dir)
{
ref1.Iters[index] = "x"
if(dir = left) ref2.Iters[index] = "x"
else ref2.Iters[index] = "x+1"

for i = 1$\rightarrow$ index-1
ref1.Iters[i] = ref2.Iters[i] = " $a_{i}$ "
end

for i = index+1 $\rightarrow$ ref1.LoopDepth
if (Appears[i]) ref1.Iters[i] = " $a_{i}$ "
else ref1.Iters[i] = " $ N_{i} $ "
end

for i = index+1 $\rightarrow$ ref2.LoopDepth
if (Appears[i]) ref2.Iters[i] = " $a_{i}$ "
else ref2.Iters[i] = "1"
end
}

We first partition the iteration space into elementary components such that all reference instances in the partition have the same stack distance. To compute all the partitions, first we create a tree representing the complete loop structure where each node is a tuple as described below.
A node of a tree is a tuple of five elements: node
$<$ (NoChild, Children[size], Parent, SeqNo,Ref) $>$
where, NoChild represents the number of children of the current node,Children is an array of pointers to the child nodes, Parent is a pointer to the parent node, SeqNo is the index of the current node among the siblings, Ref is an array of the reference tuples, each consisting of four elements described below. This element of the tuple is defined only for the leaf nodes. The tuple reference consists of four elements: Reference
$<$ (StmtNo, ArrayName, Iters[n], Appears[n], LoopDepth) $>$
where, StmtNo is the sequence number of the statement containing the reference, ArrayName is the name of the array which is being referenced, Iters is array of the indices of all the enclosing loops, Appears is a boolean array indicating if the particular index appears in the array reference, and LoopDepth is the number of enclosing loops.
We then create a tree representing the loop structure for each array by deleting all the loop nests which does not contain any reference to that array. The StmtNo will however be the same as that of the original loop for identification. We formulate an algorithm to partition the iteration space such that all reference instances of an array in a partition will have same incoming dependences. The function outputs the source and target statement numbers and their iteration instances corresponding to each partition.

The partitioning algorithm works as follows. First we begin with the last reference in the loop nest. We then traverse upward in the tree level by level. At each level, if the reuse is from the same reference then the algorithm outputs the required iteration instances. Otherwise if the reuse is from a reference in some other statement, it computes the corresponding iteration instances. This is done for all the references in the imperfectly nested loop nest in a recursive fashion. The details are provided in the algorithm shown in the Figure  3. The algorithm begins with a call Partition (n, n.Ref, n.Ref.LoopDepth, 0), where n is the rightmost leaf node and outputs the partitions. Cache Misses Estimation Using Stack Distances figure*[tbh]



minipage tabbing xxxxxxxxxxxxxxxxxxxxxxxx
Function StackDist( Array A, $<S,\vec{i}>, <S,\vec{j}>$ )
{
if (All loop indices appear in the reference to array A)
{
Cost = $(\sum_{k=1}^{n}(j_k - i_k)* (\prod_{l=1}^{k-1} N_l ) )$
return
}

let m be the first non-appearing loop index in reference to array A,
let the iteration vector, $\vec{i}$ be represented as ( $\vec{i_{prefix}},i_m,\vec{i_{postfix}}$)

if ( $ \vec{i_{prefix}} < \vec{j_{prefix}} $)
{
Cost = StackDist(A, $<S,(i_m , \vec{i_{postfix}})>,<S,\vec{N}>$)
+ StackDist(A, $ <S,(\vec{i_{prefix}}+1,1,...1)> , <S,(\vec{j_{prefix}}-1,N,...,N)>$)
+ StackDist(A, $ <S, \vec{1}>,<S,(j_m,\vec{j_{postfix}})> $)
}

if ( $\vec{i_{prefix}} = \vec{j_{prefix}}$ )
{
if ( $i_m < j_m - 1$ ) then Cost = cost of 1 complete iteration of m loop
else if ( $i_m = j_m$ ) then StackDist(A, $ <S,(\vec{i_{postfix}})> , <A, (\vec{j_{postfix}})>$)
else
{
if ($i_m = j_m-1$)
{
let k be the second non-appearing loop index
// Note that if no such k exists, then $\vec{i_{prefix}'} = \vec{i_{postfix}}$ and $(i_k , \vec{i_{postfix}'})$ = NIL

let $\vec{i_{postfix}}$ be represented as ( $\vec{i_{prefix}'}, i_k, \vec{i_{postfix}'} $)
and let the $\vec{j_{postfix}'}$ be represented as ( $\vec{j_{prefix}'},j_m,\vec{j_{postfix}'}$)

if ( $\vec{i_{prefix}'} > \vec{j_{prefix}'}$) then
{
Cost = StackDist( A, $<S,\vec{i_{postfix}}>, <S,\vec{N}> $)
+ StackDist(A, $<S,\vec{j_{postfix}}>, <S,\vec{N}> $)
}
else
{
if ($ i_k = N $ and $j_k = 1 $) then
Cost = StackDist( $A, <S, (i_m,\vec{i_{prefix}'},\vec{i_{postfix}'})>,(<j_m, \vec{j_{prefix}'},\vec{j_{postfix}'})>$)
else Cost = cost of 1 complete iteration of m
}
}
}
}
else { Cost = 0 }
}

figure*[tbh]


minipage tabbing xxxxxxxxxxxxxxxxxxxxxxxx
Function StackDistImperfect( A, $<S_1,\vec{i}>,<S_2, \vec{j}>$ )
{

let $\vec{i_{com}},\vec{j_{com}}$ represent the set of indices common to $\vec{i}$ and $\vec{j} $
let $\vec{i_c}$ and $\vec{j_c}$ represent the last index in the common indices set.
let $\vec{i}$ be represented as ($\vec{i_{com}}$, $\vec{i_{uncom}}$).

if ($\vec{i_{com}}$ != $\vec{j_{com}} $)
{
Cost = $(\sum_{k=1}^{c} (j_{k} - i_{k}) (\prod_{l=1}^{k-1} N_l)) $ * cost of 1 iteration of c loop
}
else {
if ( Appear( $\vec{i_{uncom}}$) $<$ Appear( $\vec{j_{uncom}}$) ) Cost = cost of 1 iteration of c loop
else
{
Cost = StackDist(A,$<S_1$,Appear( $\vec{i_{com}})>,<S_2,\vec{N}>$)
+ StackDist(A, $<S_1,\vec{1}>$,$<S_2$, Appear( $\vec{j_{uncom}}$))
}
}
}


Function Appear($\vec{i}$) {
/* This function sets(or removes) the indices to appropriate values depending on the dependence and
whether it is a source or a target iteration vector */

foreach index j in $\vec{i}$,
if (j is a appearing index )
{ retain j }
else { /* j is non-appearing index */
if (j = 1 or j = N) then
{
remove the index from $\vec{i}$ }
else
{
(if ($\vec{i}$ corresponds to the source)
{
set rest of the appearing indices to N and remove all the non-appearing indices
return
}
else
{
set rest of the appearing indices to 1 and remove all the non-appearing indices
return
}
}
}
end foreach
}

figure[tbh]


minipage tabbing xxxxxxxxxxxxxxxxxxxxxxxx
S1.
FOR mT, nT, mI, nI
S2.


B[mT+mI,nT+nI] = 0
S3.
FOR iT, nT
S4.


FOR iI, nI
S5.



T[iI,nI] = 0
S6.


FOR jT, iI, nI, jI
S7.



T[iI,nI] += A[iT+iI,jT+jI]















* C2[nT+nI,jT+jI]
S8.


FOR mT, iI, nI, mI
S9.



B[mT+mI,nT+nI] += T[iI,nI]















* C1[mT+mI,iT+iI]

Once the iteration space is partitioned into components corresponding to the same stack distance expressions, we compute the stack distances for each of the components. Let $cost$ of one iteration of a loop be defined as the total number of distinct memory locations accessed in one complete iteration of that loop. Let us first describe the algorithm to compute the cost of each array with respect to a reuse (i.e. the number of distinct memory locations of that array accessed from source iteration vector to the target iteration vector of the reuse) in case of perfectly nested loop nest. The stack distance of a reuse is the sum of the costs corresponding to all the arrays accessed between the source iteration vector and target iteration vector. If all the loop indices in the source and target iteration vector appear in the reference to the the array, then cost of the array is the number of iteration points between the source and the target. However, if some of the loop indices in the iteration vector do not appear in the array reference, we find out the first non-appearing loop index. We then consider two cases depending on whether the loop indices before this non-appearing loop index are same or not. Note that if a loop index does not appear in the array access function, then all iterations of this loop will access the same elements of the array. The details of the algorithm are presented in the Figure  4.

Now let us consider the case of reuse within an imperfectly nested loop structure between two statements, $S7$ and $S9$ as shown in the figure  6. The cost in this case is calculated using the algorithm as shown in Figure  5.

The algorithm shown in the Figure  5 needs modification to handle the cases where we have auxiliary branches in the tree in addition to the source and target branch as shown in the parse tree of Figure 7 for the reuse of array T between statements $S7$ and $S9$. Let us compute the stack distance corresponding to the reuse for the statement S from iteration $\vec{i}$ to $\vec{j} $ as shown. Let $\vec{i_{prefix}}$ be the set of common indices with the auxiliary branch. If ( $\vec{i_{prefix}} < \vec{j_{prefix}}) $ then the auxiliary branch is traversed completely. Hence, we set all the lower limits of the non-appearing uncommon indices to 1 if the branch appears after the current branch and set all the upper limits of the non-appearing uncommon indices to N if the auxiliary branch appears before the current branch. We then apply the above described algorithm to compute the stack distance.

The above algorithm computes the cost for the arrays which appear in the statements $S_1$ and $S_2$. However, if an array reference does not appear in the statements $S_1$ or $S_2$ when considering the reuse between the iteration $\vec{i}$ of $S_1$ and iteration $\vec{j} $ of $S_2$. The array however, is referenced by the statements in the auxiliary branches. We would like to compute the number of distinct references made to this array as it is included in the expression for stack distance computation. There are three ways in which the statements referencing this particular array appears with respect to the statement S, (the statement which is the target of the reuse). a) The statement referencing the array appears before the statement S, b) The statement referencing the array appears after the statement S, and c) The statement referencing the array appears both before and after the statement S. We handle these three cases as follows.

Case a) Let $\vec{i_{prefix}}$ be the set of common indices between the branch containing the current statement and the auxiliary branch corresponding to the first statement referencing the array. In this case, the source is set to ($i_{prefix}$+1,1,....,1) and the target is set to ($j_{prefix}$,N,...,N)

Case b) Let $\vec{i_{prefix}}$ be the set of common indices between the branch containing the current statement and the auxiliary branch corresponding to the last statement referencing the array. In this case, the source is set to ($i_{prefix}$,1,....,1) and the target is set to ($j_{prefix}$-1,N,...,N)

Case c) In this case, we set the source to ($i_{prefix}$,1,....,1) and ($j_{prefix}$,N,...,N), where $i_{prefix}$ and $j_{prefix}$ are defined in the case a) and b) respectively. Note that the dimension of the vector $\vec{i_{prefix}}$ and the vector $\vec{j_{prefix}}$ may be different.

With source and target as described above, we compute the cost corresponding to these arrays using the algorithm described above. The stack distance is again the sum of the costs corresponding to all the arrays accessed between the source iteration vector and the target iteration vector.

We used our algorithms to compute the stack distance expressions for tiled matrix multiplication and the two index transform at compile time. These expressions were then used to predict the number of cache misses. These results were compared against the the results obtained from the SimpleScalar Tool Set. We used fully associative caches for doing the simulations using the sim-cache cache simulator of the SimpleScalar Tool Set. The results of the cache miss prediction and simulation from sim-cache are shown in Table 2 and Table 3 and show that the model is very accurate.

Figure 7: Parse Tree for the tiled 2-index transform.
\begin{figure}\epsfig{clip=, width=0.99\linewidth, file=Figures/parse2.eps}\end{figure}


Table 2: Cache miss prediction results for the two-index transform
Loop Bounds (I,J,M,N) Tile Sizes($T_{i}$, $T_{j}$, $T_{m}$, $T_{n}$) Cache Size #Predicted misses #Actual misses
(256,256,256,256) (64,128,128,64) 256KB 589824 597634
(256,256,256,256) (64,128,128,64) 256KB 1114112 1109625
(512,512,512,512) (128,128,128,128) 256KB 6029312 6280096
(256,256,256,256) (64,64,64,64) 64KB 1638400 1693237
(256,256,256,256) (128,64,64,128) 64KB 34406400 34462209
(512,256,256,512) (128,64,64,128) 64KB 136970240 137751583



Table 3: Cache miss prediction results for matrix multiplication
Loop Bounds (N) Tile Sizes($T_{i}$, $T_{j}$, $T_{k}$) Cache Size #Predicted misses #Actual misses
(512) (32 32 32) 64KB 8650752 8645485
(512) (64 64 64) 64KB 6291456 6238845
(512) (128,128,128) 64KB 136314880 136319615
(256) (32 64 32) 16KB 1310720 1312382
(256) (64,64,64) 16KB 17301504 17303166
(256) (32 64 128) 16KB 17170432 17172096


Tile Size Search Algorithm We use an intelligent search strategy to find the best tile size without searching all the tile sizes exhaustively. Several properties of the stack distances are used to prune the tile size search space.

There are two types of reuse, inter-tile reuse where reuse occurs between iterations in two different tiles and intra-tile reuse where reuse occurs between iterations within a tile. The following two facts are used in pruning the search space:

1. The stack distances of inter-tile reuses are greater than stack distances of intra-tile reuse. Few of the inter-tile reuse between consecutive tiles may have lesser stack distances. But for simplifying the discussion we do not consider them. Actually they are used in the algorithm for good tile size generation.

2. As the tile size is increased, number of intra-tile reuses which corresponds to interior points of the tile monotonically increases and number of inter-tile reuses which corresponds to boundary points monotonically decreases. This means some of the inter-tile reuses becomes intra-tile reuses with increase in tile sizes.

Due to these two properties, teh number of cache misses follows a specific pattern as the tile size is increased. The transitions of number of cache misses can be divided into four phases.

Phase 1: There will be a few compulsory cache misses corresponding to some inter-tile stack distances which are more than the cache size for any tile size. Even if the tile sizes are very less these misses will occur. This phase corresponds to the tile sizes from minimum values to the sizes where some additional inter-tile stack distances exceeds the cache size. As the number of inter-tile reuses decreases with increase in tile size in this phase, some of the cache misses now becomes hits. Hence, In this phase number of misses will gradually decrease.

Phase 2: In this phase, tile sizes spans from tile sizes of Phase 1 till tile sizes where all the inter-tile stack distances exceeds cache size. As tile size increases in this phase, at one point some inter-tile stack distance will exceed cache size and at this tile size, number of cache misses will abruptly increase. After this number of misses will continue to decrease till some other stack distance exceeds cache size. This is again due to the fact that number of inter-tile reuses decreases with increase in tile size and thereby converting some of the misses into hits.

Phase 3: In this phase, tile sizes starts from tile sizes of Phase 2 till tile sizes where all the intra-tile stack distances exceeds cache size (Few stack distances may never exceed cache size). With increase in tile sizes in this phase, some inter-tile stack distance will exceed cache size and at this point, number of cache misses will suddenly increase. After this number of misses will continue to decrease till some other stack distance exceeds cache size. The reason being some of inter-tile reuses becomes intra-tile reuses with increase in tile size and thereby converting some of the misses into hits. But none of the hits will become misses with increase in tile size as all the inter-tile reuses will be misses in this phase.

Phase 4: This phase corresponds to the tile sizes from the previous phase till the loop bounds. Even if the tile sizes are very large, some of the accesses will always be hits as the corresponding intra-tile stack distances will be constants. But as described in earlier phases, number of misses will gradually decrease with increase in tile size.

From the above discussion we observe, when certain stack distance exceeds cache size number of cache misses abruptly increases. But as we increase tile sizes without any more stack distance exceeding the cache size, number of cache misses monotonically decreases. Thus, we need to consider tile sizes just before the point where some stack distance exceeds cache size. Other tile sizes need not be considered as there will be larger tile sizes such that no additional stack distances exceeds cache size and hence will have lower number of cache misses.

In our algorithm, first we consider different tile sizes in steps of some constant value along all dimensions. We go through all these tile sizes and select those tile sizes which cannot be increased in any dimension any further without some additional stack distance exceeding cache size. The best tile size will be very close to some of the selected tile sizes. We then do a finer search around the selected points. We collect the surrounding tiles of each selected tile and select those tile sizes which are better candidates to have lesser cache misses. We then repeat the finer search procedure using reduced step sizes for incrementing tile sizes. Finally, we remove the redundant tiles (tiles such that better tiles are already in the list) from the list.

We represent tiles of different sizes by a tuple Block:$<$TileSize, IncSize, StackDist$>$. The first member TileSize is an array of integers specifying size of the tile along each dimension. IncSize is an array of integers which gives the increment along each dimension for larger tiles. StackDist is a list of stack distances which is less than Cache Size for this tile size.

The initial set of tile sizes we consider can be represented as


\begin{displaymath}\bigodot_{d=1}^{Ndim} (1, N, Inc_d) = \end{displaymath}


\begin{displaymath}(1, N, Inc_1) \bigodot ... \bigodot (1, N, Inc_{Ndim})\end{displaymath}

where, $Inc_d$ represents tile size increment along $d^{th}$ dimension, (1, N, $Inc_d$) is the set of values from 1 till N in steps of $Inc_d$ and the operator $\bigodot$ represents the cross product of all sets.

figure*[tbh]



minipage tabbing xxxxxxxxxxxxxxxxxxxxxxxx
Function TileSizeSelect( Ndim, N )
{
/* Initial Search */
Let BlkList = List of Blocks corresponding to tile sizes from the set $ \bigodot_{d=1}^{Ndim} (1, N, Inc_d) $.

Foreach blk in BlkList
flag = true
For d = 1 $\rightarrow$ Ndim
if (some Stack Distance does not exceed Cache Size after incrementing tile size along $d^{th}$ dimension)
then flag = false
end For
if( flag = true )
then add blk to NewBlkList
end Foreach

BlkList = NewblkList

/* Finer Search around selected tile sizes */
For i = 1 $\rightarrow$ STEPS
Foreach blk in BlkList
Let TempList = List of Blocks corresponding to tile sizes from the set
$ \bigodot_{d=1}^{Ndim} (Blk.TileSize[d]-Blk.IncSize[d]/2, Blk.TileSize[d]+Blk.IncSize[d]/2, Blk.IncSize[d] ) $.
/* TempList contains blocks surrounding the current block */
Foreach Newblk in TempList
For d = 1 $\rightarrow$ Ndim
NewBlk.IncSize[d] = NewBlk.IncSize[d]/2
if (No additional Stack Distances in NewBlk exceeds Cache Size than Blk and is not already selected)
then add NewBlk to NewBlkList
end Foreach
end Foreach
BlkList = BlkList $\bigcup$ NewBlkList
end For

Eliminate redundant tile sizes from BlkList
/* Tile sizes is said to be redundant if a larger tile exists in the list such that no
additional stack distance in it exceeds cache size than the original Tile size */
}

The overall algorithm for searching is shown in Figure  8

An advantage of the search algorithm is that it can be applied even when loop bounds are not known at compile time. When the loop bounds are unknown, we consider only the stack distance expressions which does not involve loop bounds to predict the best tile sizes. During runtime when the loop bounds are actually known, the selected tile sizes can be searched to determine the best tile size. In practice with large loop bounds, this procedure will give very good tile sizes even with unknown loop bound.

We performed experiments to compute the best tile sizes for the two-index transform computation when loop bounds are not known. We used a cache size of 64KB with same loop bound along all dimensions and used the stack distances which does not involve loop bounds to predict the best tile sizes. In the search procedure we searched tile sizes upto 512 and the algorithm came up with the tile sizes of (64,16,16,128). We then ran the algorithm with several different loop bounds and when the loop bounds are large, the algorithm returns the same tile size as that of the unknown loop bound case. Only when loop bounds are very small such that everything fits in the cache, does the algorithm return different sizes for optimal tiling. The results are tabulated in Table 4.

In practice, the best tile sizes will be independent of the loop bounds when they are quite large. The reason is that most of the reuses are due to intra-tile reuses and the intra-tile reuse stack distances do not depend on loop Bounds. Hence in practice, we can expect to get very good results even if we consider only the stack distance expressions which do not involve loop bounds.


Table 4: Cache Miss prediction With known and unknown loop bounds
Loop Bound (N) Best Tile Size with known loop bounds Best Tile size with unknown loop Bound
1024 (64,16,16,128)  
512 (64,16,16,128)  
256 (64,16,16,128) (64,16,16,128)
128 (64,16,16,128)  
64 (64,64,64)  
32 (32,32,32)  



next up previous
Next: Optimizing Parallel Performance on Up: Cache Miss Characterization and Previous: Cache Miss Characterization and
rajkiran panuganti 2005-05-12