PLUTO - An automatic parallelizer and locality optimizer for multicores
PLUTO is an automatic parallelization tool that is based on the polyhedral model. The polyhedral model is a geometrical representation for programs that utilizes machinery from Linear Algebra and Linear Programming for analysis and high-level transformations. Pluto transforms C programs from source to source for coarse-grained parallelism and locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling and fusion, but not limited to it. OpenMP parallel code for multicores can be automatically generated from sequential C program sections. Outer, inner, or pipelined parallelization is achieved (purely with OpenMP pragrams), besides register tiling and making the code amenable for auto vectorization. Though the tool is fully automatic (C to OpenMP C), it provides a number of options (both command-line and through optional files) to tune aspects like tile sizes, unroll factors, and the outer level loop fusion structure. CLooG is used for code generation. A beta release can be downloaded below. Parallelized/optimized codes for some examples are also available separately below.
WARNING: The current frontend used by Pluto (LooPo's frontend) only accepts a very small subset of C -- sequences of arbitrarily nested loops with strictly affine array accesses. The polyhedral model is by no means restricted to such programs. Current state-of-the-art in the polyhedral model can accept any non-recursive program (assuming procedures in the portion of the program being optimized are either inlined or are pure function calls). Conditionals, accesses, and loop bounds need not be affine; they can be non-affine and data-dependent (dynamic); While loops are not a problem either. The Clan frontend will support extracting polyhedral representations from such general programs and will be integrated into Pluto in future. A pre-release with Clan/Candl (0.5.0) is available below.
Download
Pluto 0.4.2 (BETA), README, ChangeLog
All libraries that Pluto depends on (PipLib, PolyLib, and
CLooG) are included and are built automatically.
So nothing needs to be downloaded or installed separately.
All released versions to date
Quick Install
- tar zxvf pluto-0.4.2.tar.gz
- cd pluto-0.4.2/
- ./configure [--enable-debug]
- make
$ ./polycc test/seidel.c --tile --parallel
Number of variables: 3
Number of parameters: 2
Maximum domain dimensionality: 3
Number of loops: 3
Number of poly deps: 27
(PLUTO) Affine transformations (<var coeff's> <const>)
T(S1): (t, t+i, 2t+i+j)
3 4
1 0 0 0
1 1 0 0
2 1 1 0
t0 --> fwd_dep loop (band 0)
t1 --> fwd_dep loop (band 0)
t2 --> fwd_dep loop (band 0)
[Pluto] Outermost tilable band: t0--t2
[Pluto] After tiling:
t0 --> serial tLoop (band 0)
t1 --> parallel tLoop (band 0)
t2 --> fwd_dep tLoop (band 0)
t3 --> fwd_dep loop (band 0)
t4 --> fwd_dep loop (band 0)
t5 --> fwd_dep loop (band 0)
[PLUTO] using CLooG -f/-l options: 4 6
Output written to ./seidel.par.c
icc -O3 -openmp -lm seidel.par.c -o par -DTIME
seidel.par.c(43): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
|
|
Trying the examples
Original and transformed codes for comparing performance (same
as the ones in the reports/publications)
can be
browsed online here, or
downloaded
Experiments with these are in OSU-CISRC-10/07-TR70 (PLUTO: A
Practical Automatic Polyhedral Program Optimization System,
Bondhugula et al.) For exact reproduction of results, please use
ICC to compile all codes on an Intel Core 2 Quad Q6600. Similar
relative gains should be seen with GCC (version >= 4.1/4.2 have
OpenMP support) or on other multicores. To check what polyhedral
transformations were applied, please see the appropriate .cloog
files in the directories.
-
All examples can be run readily. In any particular example's
directory, run 'make' to build binaries and 'make clean' to
remove them. Timers are included and running time of the
kernel is printed out.
-
In any example directory:
<name>.c is the original input code, orig is the corresponding binary
<name>.tiled.c is the transformed locality tiled code (no parallelization), tiled is the corresponding binary.
<name>.par.c is the parallelized-cum-locality optimized code, par is the corresponding binary. This is usually the best code from Pluto.
<name>.par2d.c is the parallelized-cum-tiled code with two degrees of parallelism (outer, inner, or pipelined), where applicable, par2d is the corresponding binary.
<name>.opt.c is the code generated just with the transformation applied (i.e., no actual tiled or parallelized code) - this is usually not the code you want to run, but just to understand some details that are tough to see in the parallelized/tiled code
After doing a 'make', to run the original code:
$ ./orig
To run the pluto parallelized version (on a multicore with say 4 threads):
$ export OMP_NUM_THREADS=4; ./par
To run the pluto tiled version (non-parallelized, local tiled)
$ ./tiled
-
To change the compiler options or flags (it's currently set to
GCC), edit common.mk in the top-level directory. GCC
4.1/4.2 have OpenMP support and the necessary flags are
already set. The '-parallel' flag is used to generate
'orig_par' (original code auto-parallelized with ICC).
- To change problem sizes, edit decls.h.
- To test correctness of codes, run 'make test'
- CLooG input files corresponding to all transformed codes can also be found in the directories. In addition, for some examples, transformations computed by some previous frameworks (Lim/Lam affine partitioning and scheduling-based approaches) were forced by editing the CLooG file (that was automatically generated by PLUTO) and short-circuiting the tool chain.
Performance
All results are on an Intel Core2 Quad Q6600 on Linux x86-64 (2.6.18) with ICC 10.1 used to compile original and transformed/parallelized codes.
Imperfect Jacobi stencil
|
|
2-d FDTD
|
|
LU decomposition
|
|
GEMVER, Doitgen
|
|
DGEMM
|
|
ATLAS was tuned with GCC 4.1. kSelMM perf numbers were approximated using the timing report produced after tuning ATLAS ('make time'). kSelMM, kGenMM perf on 2, 3 cores was interpolated (linearly) from perf on 1 and 4 cores.
Please email me reg. anything on Pluto.
Related Projects
- GRAPHITE: The polyhedral model in GCC (wiki)
- Polyhedral model extensions
- Clan, Candl, Cloog

