next up previous
Next: Experimental Results Up: Cache Miss Characterization and Previous: The Computational Context


Optimizing Parallel Performance on Shared-Memory Systems

In this section, we show how the approach developed in this paper can be applied in deriving loop transformations for efficient execution on shared-memory parallel systems. For the class of imperfectly nested loop structures that arise in the context of the TCE, the common loops that enclose multiple imperfectly nested loops are always parallel loops without any loop-carried dependences. Therefore, these loops can be partitioned across processors for synchronization-free parallel execution of the iterations. Thus ample, easily identifiable parallellism is available in the computations. The primary limitation to scalability is the need to access shared data from global shared memory. The computational structures arising in accurate electronic structure calculations such as the coupled cluster models often involve computations on arrays that are much larger than cache, and quite often require out-of-core treatment since they are too large to fit in physical memory.

Given an imperfectly nested loop structure generated from a tensor contraction expression, the problem of effectively executing it on a shared-memory system involves choice of loops to distribute across processors, and choice of tile sizes, so as to minimize the overhead of global memory access.

The cost of accessing shared memory due to cache misses is difficult to model precisely, and lies somewhere between the following two limit models:

  1. If the total memory bus bandwidth is the limiting factor, two processors cannot simultaneously access two elements in the main memory. In this case the total cost will be proportional to the total number of cache misses, or the sum of the number of cache misses for each processor.
  2. In the case of a very high memory bus bandwidth, all the processors can simultaneously access elements in the main memory without affecting each other's access time. In this case the cost will be proportional to maximum among the memory access costs of the processors.
In reality the cost will be somewhere between those two extreme limits. For a perfectly load balanced work distribution, the two models above will have the same optimal solution. Therefore the approach we pursue for the shared-memory optimization is to consider the subset of the iteration space that each processor executes, and solve the sequential tile size optimization problem for that subset of the iteration space. The parallel code uses the tile sizes generated from such an analysis.

We use a simple example to illustrate the approach we pursue.

Figure 9: One-dimensional partitioning.
[width=.25]parallel_oneloop.eps

Consider the following code for matrix-matrix product, where the outer loop is partitioned across processors.

figure[tbh]



minipage tabbing xxxxxxxxxxxxxxxxxxxxxxxx
FOR I = MY_ID*NI/P + 1, (MY_ID + 1)*NI/P

FOR J = 1, NJ



FOR K = 1, NK




C(I,K) = C(I,K) + A(I,J)*B(J,K)



ENDFOR

ENDFOR
ENDFOR

where MY_ID is the processor number, running from 0 to $P-1$, and $I$ is the loop to be split.

The portions of the arrays accessed by each processor are as shown in Figure 9. From the point of view of each processor, the problem of minimizing its memory access cost is equivalent to the serial problem involving the original array $B$ and the corresponding slices of $A$ and $C$.



Subsections
next up previous
Next: Experimental Results Up: Cache Miss Characterization and Previous: The Computational Context
rajkiran panuganti 2005-05-12