PLUTO - An automatic parallelizer and locality optimizer for multicores
PLUTO is an automatic parallelization tool based on the polyhedral model. It transforms from source to source and can optimize regular programs (sequences of possibly imperfectly nested loops) for coarse-grained parallelism and locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling (but not limited to it). OpenMP parallel code for multicores can be automatically generated from sequential C program sections that are readily accepted in the polyhedral model (for now). A few LooPo modules for the frontend, CLooG as the code generator, PipLib and Polylib libraries (everything included). A beta release can be downloaded below. Parallelized/optimized codes for some examples are also available separately below.
Download
Pluto 0.3.0 (BETA) (updated Jul 9 2008), README, ChangeLog
All libraries that Pluto depends on (PipLib, PolyLib, and
CLooG) are now included and are configured and built automatically
as part of the Pluto build. So nothing needs to be downloaded or
installed separately for Pluto versions >= 0.2.0.
Older versions
A much faster 0.4.0 with documentation will be available by Aug
20th.
Quick Install
- tar zxvf pluto-0.3.0.tgz
- cd pluto-0.3.0/
- ./configure [--enable-debug]
- make
$ ./polycc test/seidel.c --tile --parallel
Number of variables: 3
Number of parameters: 2
Maximum domain dimensionality: 3
Number of loops: 3
Number of poly deps: 27
(PLUTO) Affine transformations (<var coeff's> <const>)
S1:
3 4
1 0 0 0
1 1 0 0
2 1 1 0
c0 --> fwd_dep/pipelined_parallel (band 0)
c1 --> fwd_dep/pipelined_parallel (band 0)
c2 --> fwd_dep/pipelined_parallel (band 0)
Outermost permutable band: 0 to 2
After tiling:
c0 --> tile_schedule (band 0)
c1 --> parallel (band 0)
c2 --> fwd_dep/pipelined_parallel (band 0)
c3 --> fwd_dep/pipelined_parallel (band 0)
c4 --> fwd_dep/pipelined_parallel (band 0)
c5 --> fwd_dep/pipelined_parallel (band 0)
[PLUTO] using CLooG -f/-l options: 4 6
Output written to ./seidel.par.c
icc -O3 -openmp -lm seidel.par.c -o par -DTIME
seidel.par.c(43): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
|
|
Trying the examples
Original and transformed codes for comparing performance (same
as the ones in the reports/publications)
can be
browsed online here, or
downloaded
Experiments with these are in OSU-CISRC-10/07-TR70 (PLUTO: A
Practical Automatic Polyhedral Program Optimization System,
Bondhugula et al.) For exact reproduction of results, please use
ICC to compile all codes on an Intel Core 2 Quad Q6600. Similar
relative gains should be seen with GCC (version >= 4.1/4.2 have
OpenMP support) or on other multicores. To check what polyhedral
transformations were applied, please see the appropriate .cloog
files in the directories.
-
All examples can be run readily. In any particular example's
directory, run 'make' to build binaries and 'make clean' to
remove them. Timers are included and running time of the
kernel is printed out.
-
In any example directory:
<name>.c is the original input code, orig is the corresponding binary
<name>.tiled.c is the transformed locality tiled code (no parallelization), tiled is the corresponding binary.
<name>.par.c is the parallelized-cum-locality optimized code, par is the corresponding binary. This is usually the best code from Pluto.
<name>.par2d.c is the parallelized-cum-tiled code with two degrees of parallelism (pipelined or outer/inner), where applicable, par2d is the corresponding binary.
<name>.opt.c is the code generated just with the transformation applied (i.e., no actual tiled or parallelized code) - this is usually not the code you want to run, but just to understand some details that are tough to see in the parallelized/tiled code
After doing a 'make', to run the original code:
$ ./orig
To run the pluto parallelized version (on a multicore with say 4 threads):
$ export OMP_NUM_THREADS=4; ./par
To run the pluto tiled version (non-parallelized, local tiled)
$ ./tiled
-
To change the compiler options or flags (it's currently set to
GCC), edit common.mk in the top-level directory. GCC
4.1/4.2 have OpenMP support and the necessary flags are
already set. The '-parallel' flag is used to generate
'orig_par' (original code auto-parallelized with ICC).
- To change problem sizes, edit decls.h.
- To test correctness of codes, run 'make test'
- CLooG input files corresponding to all transformed codes can also be found in the directories. In addition, for some examples, transformations computed by some previous frameworks (Lim/Lam affine partitioning and scheduling-based approaches) were forced by editing the CLooG file (that was automatically generated by PLUTO) and short-circuiting the tool chain.
Performance
All results are on an Intel Core2 Quad Q6600 on Linux x86-64 (2.6.18) with ICC 10.1 used to compile original and transformed/parallelized codes.
Imperfect Jacobi stencil
|
|
2-d FDTD
|
|
LU decomposition
|
|
DGEMM
|
|
ATLAS was tuned with GCC 4.1. kSelMM perf numbers were approximated using the timing report produced after tuning ATLAS ('make time'). kSelMM, kGenMM perf on 2, 3 cores was interpolated (linearly) from perf on 1 and 4 cores.
Please email me reg. anything on Pluto,
or post anonymously at these forums.
For bugs: http://sourceforge.net/forum/forum.php?forum_id=788152
For help: http://sourceforge.net/forum/forum.php?forum_id=788151
For open discussion: http://sourceforge.net/forum/forum.php?forum_id=788150
Related Projects
- GRAPHITE: The Polyhedral model in GCC (wiki)
- Polyhedral model extensions

