The optimization presented in this paper has been developed in the context of the Tensor Contraction Engine (TCE) [5,10], a domain-specific compiler for ab initio quantum chemistry calculations. The TCE takes as input a high-level specification of a computation expressed as a set of tensor contraction expressions and transforms it into efficient parallel code. Several compile-time optimizations are incorporated into the TCE: algebraic transformations to minimize operation counts [18,19], loop fusion to reduce memory requirements [15,17,16], space-time trade-off optimization [8], communication minimization [9], and data locality optimization [10,11] of memory-to-cache traffic.
A tensor contraction expression is comprised of a collection of multi-dimensional summations of the product of several input arrays. As an example, consider the following contraction, used often in quantum chemistry calculations to transform a set of two-electron integrals from an atomic orbital (AO) basis to a molecular orbital (MO) basis:
The indices
,
,
, and
have the same range
,
denoting the total number of orbitals, which is equal to
.
denotes the number of occupied orbitals and
denotes
the number of unoccupied (virtual) orbitals. Likewise, the
index ranges for
,
,
, and
are the same, and equal
to
. Typical values for
range from 10 to 300; the number
of virtual orbitals
is usually between 50 and 1000.
The calculation of
is done in the following four steps to reduce
the number of floating point operations from
in the
initial formula (8 nested loops, for
,
,
,
,
,
,
, and
) to
:
This operation-minimization transformation results in the creation of three intermediate arrays:
For illustration purposes, we focus on the following contraction
(a two-index transform):
The output of fusion step on the TCE synthesis process is thus a sequence of imperfectly nested loops with potentially symbolic loop bounds. Note that the array reference expressions are individual loop indices. Non-constant dependences result from the absence of a loop index in a array reference. In this paper, we evaluate the cache miss cost for this class of computations. The application of our model to tiling and parallelization are also done in this context.
In this section, we shall discuss some of the prevalent cache behavior models and compare their characteristics.
One of the most common models of cache behavior is reuse distance. Reuse distance refers to the number of iterations between two successive references to the same element of an array. If the number of intervening iterations is small, there is a high likelihood that the element will be retained in cache between the references, i.e. the reuse in the program translates into locality of reference. Thus, the smaller the reuse distance, the greater the expected cache locality, and the lower the cache miss cost. Computing reuse distances and guiding loop transformations based on it has been addressed [1]. But improvements in reuse distance may not necessarily translate to improvements in cache miss cost.
Capacity misses have been used as a measure of modeling cache behavior in the context of a domain-specific compiler for determining optimal tile sizes [10]. It involves determining the total number of distinct memory accesses within a set of loop iterations. The total capacity miss cost is evaluated as the number of iterations of a loop nest within which the number of distinct memory locations accessed exceeds the cache size. The imperfectly nested loop case is handled similarly. The program is transformed such that the program reuse is moved within the innermost loops, thus reducing the number of capacity misses. This metric does not consider interferences between array references. Also, though the total number of memory locations accessed may exceed the cache size, some of the array references might still exhibit good locality.
One of the more accurate methods for analyzing cache behavior is based the concept of stack distances [6]. Stack distance refers to the number of distinct memory accesses between two consecutive references to the same location. It captures the memory access pattern of a program, including interference between array references, from a memory trace. A compile-time characterization of the stack distances in a program can directly be used to determine the number of cache misses, assuming of a fully associative cache with LRU replacement policy.
Determining the stack distances for an arbitrary program is extremely difficult. In general, the program might have to be executed to obtain the memory trace. But, for programs that have regular access patterns, the stack distances can potentially be estimated at compile-time, providing an accurate measure of the cache miss cost. We base our model on the stack distances approach.
Though there has been work on more accurate analysis of cache miss cost, taking into account associativity and the resulting conflict misses [7,12,13], they have not been amenable to use in a model-driven optimizing compiler.
Illustration of Cache Miss Prediction using Stack Distance Computation We are using a stack distance based approach for cache miss prediction. To predict cache misses at compile time, we need to compute the stack distances for each of the references. We split the iteration space of each reference into several components to effectively compute the stack distances. In each of the these components, all reference instances will have same incoming dependences and hence will have the same stack distance. So we need to compute the stack distance for each of the components of each reference separately. The number of distinct memory locations accessed between the source and target of the reuse will give the stack distance of each component. After we determine the stack distances, it is straight forward to calculate the number of cache misses. All the references for which stack distance is greater than cache size will be cache misses and all other references will be cache hits. These stack distance expressions can also be used to predict good tile sizes. Our algorithm can handle symbolic loop bounds and imperfectly nested loops where array index expressions are loop indices. We shall first describe our overall algorithm using the matrix multiplication code fragment in the next section. Matrix Multiplication Example In this section, we present the example of matrix multiplication to illustrate the working of our overall algorithm. figure*[tbh]
Once all the reuses are identified, our algorithm proceeds as
follows:
Step 1: Partitioning the iteration Space: In this
case, the iteration space can be partitioned into nine elementary
partitions. In three of the elementary components, the instances of
A, B, and C respectively have no incoming dependence i.e. they have
no reuse. The instances corresponding to
= (a,b,1,c,d,1)
of A,
= (a,1,b,c,1,d) of B and
= (1,a,b,1,c,d)
of C, where a,b,c,d are any indices within the limits of the
corresponding loop bounds, have no reuses. In the other six
elementary components, the reuses are present. The source and the
target iteration vectors for these partitions are shown in the Table
1.
Step 2: Compute the Stack Distance for each component
of each reference : In our example, there are six reuses. We
determine the dependence span (the iterations starting from source
to target of the reuse) for each of these reuses and identify the
array elements accessed in those iterations. Let cost of an
array with respect to a reuse be the number of distinct memory
locations of that array accessed from source iteration vector to the
target iteration vector of the reuse. Now the algorithm computes the
cost of each array. The algorithm then computes the sum of costs of
each array accessed by the iterations within the dependence span of
the reuse.
The number of array elements accessed is a function of the source
and target iteration of the reuse. For this example, the source and
target of all the reuses are shown in the Table 1.
Note that the these are computed symbolically and it represents all
instances of the array reference targeted by the dependence. The
instances corresponding to
= (a,b,1,c,d,1) of A,
= (a,1,b,c,1,d) of B and
= (1,a,b,1,c,d) of C, where
a,b,c,d are any indices within the limits of the corresponding loop
bounds, have no incoming dependences. Hence the stack distances for
these instances would be
. For all the other reuses, we
calculate the cost for each array in the dependence span and then
sum these costs for all arrays that are accessed between the reuse.
Notice that the cost for each array is computed symbolically. The
stack distances computed for this example is shown in table
1. Now using the stack distance expressions, we
will evaluate the optimal tile size as described in the section
6.
Partitioning the Iteration Space figure*[tbh]
We first partition the iteration space into elementary components
such that all reference instances in the partition have the same
stack distance. To compute all the partitions, first we create a
tree representing the complete loop structure where each node is a
tuple as described below.
A node of a tree is a tuple of five elements: node
(NoChild,
Children[size], Parent, SeqNo,Ref)
where, NoChild
represents the number of children of the current node,Children
is an array of pointers to the child nodes, Parent is a
pointer to the parent node, SeqNo is the index of the current
node among the siblings, Ref is an array of the reference tuples,
each consisting of four elements described below. This element of
the tuple is defined only for the leaf nodes. The tuple reference
consists of four elements:
Reference
(StmtNo, ArrayName, Iters[n], Appears[n], LoopDepth)
where, StmtNo is the sequence number of the statement
containing the reference, ArrayName is the name of the array
which is being referenced, Iters is array of the indices of
all the enclosing loops, Appears is a boolean array indicating
if the particular index appears in the array reference, and LoopDepth is the number of enclosing loops.
We then create a tree representing the loop structure for each array
by deleting all the loop nests which does not contain any reference
to that array. The StmtNo will however be the same as that of
the original loop for identification. We formulate an algorithm to
partition the iteration space such that all reference instances of
an array in a partition will have same incoming dependences. The
function outputs the source and target statement numbers and their
iteration instances corresponding to each partition.
The partitioning algorithm works as follows. First we begin with the last reference in the loop nest. We then traverse upward in the tree level by level. At each level, if the reuse is from the same reference then the algorithm outputs the required iteration instances. Otherwise if the reuse is from a reference in some other statement, it computes the corresponding iteration instances. This is done for all the references in the imperfectly nested loop nest in a recursive fashion. The details are provided in the algorithm shown in the Figure 3. The algorithm begins with a call Partition (n, n.Ref, n.Ref.LoopDepth, 0), where n is the rightmost leaf node and outputs the partitions. Cache Misses Estimation Using Stack Distances figure*[tbh]
Now let us consider the case of reuse within an imperfectly nested
loop structure between two statements,
and
as shown in the
figure 6. The cost in this case is calculated using
the algorithm as shown in Figure 5.
The algorithm shown in the Figure 5 needs
modification to handle the cases where we have auxiliary branches in
the tree in addition to the source and target branch as shown in the
parse tree of Figure 7 for the reuse of
array T between statements
and
. Let us compute the stack
distance corresponding to the reuse for the statement S from
iteration
to
as shown. Let
be
the set of common indices with the auxiliary branch. If
(
then the auxiliary branch
is traversed completely. Hence, we set all the lower limits of the
non-appearing uncommon indices to 1 if the branch appears after the
current branch and set all the upper limits of the non-appearing
uncommon indices to N if the auxiliary branch appears before the
current branch. We then apply the above described algorithm to
compute the stack distance.
The above algorithm computes the cost for the arrays which appear in
the statements
and
. However, if an array reference does
not appear in the statements
or
when considering the
reuse between the iteration
of
and iteration
of
. The array however, is referenced by the
statements in the auxiliary branches. We would like to compute the
number of distinct references made to this array as it is included
in the expression for stack distance computation. There are three
ways in which the statements referencing this particular array
appears with respect to the statement S, (the statement which is the
target of the reuse). a) The statement referencing the array appears
before the statement S, b) The statement referencing the array
appears after the statement S, and c) The statement referencing the
array appears both before and after the statement S. We handle these three cases as follows.
Case a) Let
be the set of common indices between
the branch containing the current statement and the auxiliary branch
corresponding to the first statement referencing the array. In this
case, the source is set to (
+1,1,....,1) and the target
is set to (
,N,...,N)
Case b) Let
be the set of common indices between
the branch containing the current statement and the auxiliary branch
corresponding to the last statement referencing the array. In this
case, the source is set to (
,1,....,1) and the target
is set to (
-1,N,...,N)
Case c) In this case, we set the source to (
,1,....,1)
and (
,N,...,N), where
and
are
defined in the case a) and b) respectively. Note that the dimension
of the vector
and the vector
may be different.
With source and target as described above, we compute the cost corresponding to these arrays using the algorithm described above. The stack distance is again the sum of the costs corresponding to all the arrays accessed between the source iteration vector and the target iteration vector.
We used our algorithms to compute the stack distance expressions for tiled matrix multiplication and the two index transform at compile time. These expressions were then used to predict the number of cache misses. These results were compared against the the results obtained from the SimpleScalar Tool Set. We used fully associative caches for doing the simulations using the sim-cache cache simulator of the SimpleScalar Tool Set. The results of the cache miss prediction and simulation from sim-cache are shown in Table 2 and Table 3 and show that the model is very accurate.
|
|
There are two types of reuse, inter-tile reuse where reuse occurs between iterations in two different tiles and intra-tile reuse where reuse occurs between iterations within a tile. The following two facts are used in pruning the search space:
1. The stack distances of inter-tile reuses are greater than stack distances of intra-tile reuse. Few of the inter-tile reuse between consecutive tiles may have lesser stack distances. But for simplifying the discussion we do not consider them. Actually they are used in the algorithm for good tile size generation.
2. As the tile size is increased, number of intra-tile reuses which corresponds to interior points of the tile monotonically increases and number of inter-tile reuses which corresponds to boundary points monotonically decreases. This means some of the inter-tile reuses becomes intra-tile reuses with increase in tile sizes.
Due to these two properties, teh number of cache misses follows a specific pattern as the tile size is increased. The transitions of number of cache misses can be divided into four phases.
Phase 1: There will be a few compulsory cache misses corresponding to some inter-tile stack distances which are more than the cache size for any tile size. Even if the tile sizes are very less these misses will occur. This phase corresponds to the tile sizes from minimum values to the sizes where some additional inter-tile stack distances exceeds the cache size. As the number of inter-tile reuses decreases with increase in tile size in this phase, some of the cache misses now becomes hits. Hence, In this phase number of misses will gradually decrease.
Phase 2: In this phase, tile sizes spans from tile sizes of Phase 1 till tile sizes where all the inter-tile stack distances exceeds cache size. As tile size increases in this phase, at one point some inter-tile stack distance will exceed cache size and at this tile size, number of cache misses will abruptly increase. After this number of misses will continue to decrease till some other stack distance exceeds cache size. This is again due to the fact that number of inter-tile reuses decreases with increase in tile size and thereby converting some of the misses into hits.
Phase 3: In this phase, tile sizes starts from tile sizes of Phase 2 till tile sizes where all the intra-tile stack distances exceeds cache size (Few stack distances may never exceed cache size). With increase in tile sizes in this phase, some inter-tile stack distance will exceed cache size and at this point, number of cache misses will suddenly increase. After this number of misses will continue to decrease till some other stack distance exceeds cache size. The reason being some of inter-tile reuses becomes intra-tile reuses with increase in tile size and thereby converting some of the misses into hits. But none of the hits will become misses with increase in tile size as all the inter-tile reuses will be misses in this phase.
Phase 4: This phase corresponds to the tile sizes from the previous phase till the loop bounds. Even if the tile sizes are very large, some of the accesses will always be hits as the corresponding intra-tile stack distances will be constants. But as described in earlier phases, number of misses will gradually decrease with increase in tile size.
From the above discussion we observe, when certain stack distance exceeds cache size number of cache misses abruptly increases. But as we increase tile sizes without any more stack distance exceeding the cache size, number of cache misses monotonically decreases. Thus, we need to consider tile sizes just before the point where some stack distance exceeds cache size. Other tile sizes need not be considered as there will be larger tile sizes such that no additional stack distances exceeds cache size and hence will have lower number of cache misses.
In our algorithm, first we consider different tile sizes in steps of some constant value along all dimensions. We go through all these tile sizes and select those tile sizes which cannot be increased in any dimension any further without some additional stack distance exceeding cache size. The best tile size will be very close to some of the selected tile sizes. We then do a finer search around the selected points. We collect the surrounding tiles of each selected tile and select those tile sizes which are better candidates to have lesser cache misses. We then repeat the finer search procedure using reduced step sizes for incrementing tile sizes. Finally, we remove the redundant tiles (tiles such that better tiles are already in the list) from the list.
We represent tiles of different sizes by a tuple Block:
TileSize, IncSize, StackDist
. The first member TileSize
is an array of integers specifying size of the tile along each
dimension. IncSize is an array of integers which gives the increment
along each dimension for larger tiles. StackDist is a list of stack
distances which is less than Cache Size for this tile size.
The initial set of tile sizes we consider can be represented as
where,
represents tile size increment along
dimension, (1, N,
) is the set of values from 1 till N in
steps of
and
the operator
represents the cross product of all sets.
figure*[tbh]
The overall algorithm for searching is shown in Figure 8
An advantage of the search algorithm is that it can be applied even when loop bounds are not known at compile time. When the loop bounds are unknown, we consider only the stack distance expressions which does not involve loop bounds to predict the best tile sizes. During runtime when the loop bounds are actually known, the selected tile sizes can be searched to determine the best tile size. In practice with large loop bounds, this procedure will give very good tile sizes even with unknown loop bound.
We performed experiments to compute the best tile sizes for the two-index transform computation when loop bounds are not known. We used a cache size of 64KB with same loop bound along all dimensions and used the stack distances which does not involve loop bounds to predict the best tile sizes. In the search procedure we searched tile sizes upto 512 and the algorithm came up with the tile sizes of (64,16,16,128). We then ran the algorithm with several different loop bounds and when the loop bounds are large, the algorithm returns the same tile size as that of the unknown loop bound case. Only when loop bounds are very small such that everything fits in the cache, does the algorithm return different sizes for optimal tiling. The results are tabulated in Table 4.
In practice, the best tile sizes will be independent of the loop bounds when they are quite large. The reason is that most of the reuses are due to intra-tile reuses and the intra-tile reuse stack distances do not depend on loop Bounds. Hence in practice, we can expect to get very good results even if we consider only the stack distance expressions which do not involve loop bounds.