Next: Experimental Results
Up: Cache Miss Characterization and
Previous: The Computational Context
Optimizing Parallel Performance on Shared-Memory Systems
In this section, we show how the approach
developed in this paper can be applied in deriving loop
transformations for efficient execution on shared-memory parallel
systems. For the class of imperfectly nested loop structures that
arise in the context of the TCE, the common loops that enclose
multiple imperfectly nested loops are always parallel loops without
any loop-carried dependences. Therefore, these loops can be
partitioned across processors for synchronization-free parallel
execution of the iterations. Thus ample, easily identifiable
parallellism is available in the computations. The primary
limitation to scalability is the need to access shared data from
global shared memory. The computational structures arising in
accurate electronic structure calculations such as the coupled
cluster models often involve computations on arrays that are much
larger than cache, and quite often require out-of-core treatment
since they are too large to fit in physical memory.
Given an imperfectly nested loop structure generated from
a tensor contraction expression, the problem of effectively
executing it on a shared-memory system involves choice of
loops to distribute across processors, and choice of tile
sizes, so as to minimize the overhead of global memory access.
The cost of accessing shared memory due to cache misses is difficult
to model precisely, and lies somewhere between the following
two limit models:
- If the total memory bus bandwidth is the limiting factor,
two processors cannot
simultaneously access two elements in the main memory.
In this case the total cost will be proportional to the total number of
cache misses, or the sum of the number of cache misses for each processor.
- In the case of a very high memory bus bandwidth, all the processors can
simultaneously access elements in the main memory without affecting
each other's access time.
In this case the cost will be proportional to maximum among the
memory access costs of the processors.
In reality the cost will be somewhere between those two extreme limits.
For a perfectly load balanced work distribution, the two models above
will have the same optimal solution. Therefore the approach we pursue
for the shared-memory optimization is to consider the subset of the
iteration space that each
processor executes, and solve the sequential tile size optimization
problem for that subset of the iteration space. The parallel code uses the
tile sizes generated from such an analysis.
We use a simple example to illustrate the approach we pursue.
Figure 9:
One-dimensional partitioning.
|
[width=.25]parallel_oneloop.eps
|
Consider the following code for matrix-matrix product, where the
outer loop is partitioned across processors.
figure[tbh]
minipage
tabbing
xxxxxxxxxxxxxxxxxxxxxxxx
FOR I = MY_ID*NI/P + 1, (MY_ID + 1)*NI/P
FOR J = 1, NJ
FOR K = 1, NK
C(I,K) = C(I,K) + A(I,J)*B(J,K)
ENDFOR
ENDFOR
ENDFOR
where MY_ID is the processor number, running from 0 to
, and
is the loop to be split.
The portions of the arrays accessed by each processor are as shown
in Figure 9.
From the point of view of each processor, the problem of minimizing
its memory access cost is equivalent to
the serial problem involving the original array
and the corresponding
slices of
and
.
Subsections
Next: Experimental Results
Up: Cache Miss Characterization and
Previous: The Computational Context
rajkiran panuganti
2005-05-12