OSU logo

PLUTO - An automatic parallelizer and locality optimizer for multicores


PLUTO is an automatic parallelization tool based on the polyhedral model. It transforms from source to source and can optimize regular programs (sequences of possibly imperfectly nested loops) for coarse-grained parallelism and locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling (but not limited to it). OpenMP parallel code for multicores can be automatically generated from sequential C program sections that are readily accepted in the polyhedral model (for now). A few LooPo modules for the frontend, CLooG as the code generator, PipLib and Polylib libraries (everything included). A beta release can be downloaded below. Parallelized/optimized codes for some examples are also available separately below.

Download

Pluto 0.3.0 (BETA) (updated Jul 9 2008), README, ChangeLog

All libraries that Pluto depends on (PipLib, PolyLib, and CLooG) are now included and are configured and built automatically as part of the Pluto build. So nothing needs to be downloaded or installed separately for Pluto versions >= 0.2.0.
Older versions

A much faster 0.4.0 with documentation will be available by Aug 20th.

Quick Install

  1. tar zxvf pluto-0.3.0.tgz
  2. cd pluto-0.3.0/
  3. ./configure [--enable-debug]
  4. make
$ ./polycc test/seidel.c --tile --parallel

Number of variables: 3
Number of parameters: 2
Maximum domain dimensionality: 3
Number of loops: 3
Number of poly deps: 27

(PLUTO) Affine transformations (<var coeff's> <const>)

S1:
3 4
 1  0  0  0
 1  1  0  0
 2  1  1  0

c0 --> fwd_dep/pipelined_parallel (band 0)
c1 --> fwd_dep/pipelined_parallel (band 0)
c2 --> fwd_dep/pipelined_parallel (band 0)

Outermost permutable band: 0 to 2

After tiling:
c0 --> tile_schedule (band 0)
c1 --> parallel (band 0)
c2 --> fwd_dep/pipelined_parallel (band 0)
c3 --> fwd_dep/pipelined_parallel (band 0)
c4 --> fwd_dep/pipelined_parallel (band 0)
c5 --> fwd_dep/pipelined_parallel (band 0)

[PLUTO] using CLooG -f/-l options: 4 6
Output written to ./seidel.par.c

icc -O3 -openmp -lm seidel.par.c -o par  -DTIME
seidel.par.c(43): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
            
2-d 
                        Gauss Seidel

Trying the examples

Original and transformed codes for comparing performance (same as the ones in the reports/publications) can be

browsed online here, or
downloaded

Experiments with these are in OSU-CISRC-10/07-TR70 (PLUTO: A Practical Automatic Polyhedral Program Optimization System, Bondhugula et al.) For exact reproduction of results, please use ICC to compile all codes on an Intel Core 2 Quad Q6600. Similar relative gains should be seen with GCC (version >= 4.1/4.2 have OpenMP support) or on other multicores. To check what polyhedral transformations were applied, please see the appropriate .cloog files in the directories.


  1. All examples can be run readily. In any particular example's directory, run 'make' to build binaries and 'make clean' to remove them. Timers are included and running time of the kernel is printed out.

  2. In any example directory:
    <name>.c is the original input code, orig is the corresponding binary
    <name>.tiled.c is the transformed locality tiled code (no parallelization), tiled is the corresponding binary.
    <name>.par.c is the parallelized-cum-locality optimized code, par is the corresponding binary. This is usually the best code from Pluto.
    <name>.par2d.c is the parallelized-cum-tiled code with two degrees of parallelism (pipelined or outer/inner), where applicable, par2d is the corresponding binary.
    <name>.opt.c is the code generated just with the transformation applied (i.e., no actual tiled or parallelized code) - this is usually not the code you want to run, but just to understand some details that are tough to see in the parallelized/tiled code

    After doing a 'make', to run the original code:
    $ ./orig
    To run the pluto parallelized version (on a multicore with say 4 threads):
    $ export OMP_NUM_THREADS=4; ./par
    To run the pluto tiled version (non-parallelized, local tiled)
    $ ./tiled

  3. To change the compiler options or flags (it's currently set to GCC), edit common.mk in the top-level directory. GCC 4.1/4.2 have OpenMP support and the necessary flags are already set. The '-parallel' flag is used to generate 'orig_par' (original code auto-parallelized with ICC).

  4. To change problem sizes, edit decls.h.

  5. To test correctness of codes, run 'make test'

  6. CLooG input files corresponding to all transformed codes can also be found in the directories. In addition, for some examples, transformations computed by some previous frameworks (Lim/Lam affine partitioning and scheduling-based approaches) were forced by editing the CLooG file (that was automatically generated by PLUTO) and short-circuiting the tool chain.

Performance

All results are on an Intel Core2 Quad Q6600 on Linux x86-64 (2.6.18) with ICC 10.1 used to compile original and transformed/parallelized codes.

Imperfect Jacobi stencil

Jacobi (single core) Jacobi (multiple cores)

2-d FDTD

LU LU

LU decomposition

LU LU

DGEMM

LU LU

ATLAS was tuned with GCC 4.1. kSelMM perf numbers were approximated using the timing report produced after tuning ATLAS ('make time'). kSelMM, kGenMM perf on 2, 3 cores was interpolated (linearly) from perf on 1 and 4 cores.


Please email me reg. anything on Pluto, or post anonymously at these forums.
For bugs: http://sourceforge.net/forum/forum.php?forum_id=788152
For help: http://sourceforge.net/forum/forum.php?forum_id=788151
For open discussion: http://sourceforge.net/forum/forum.php?forum_id=788150


Related Projects

Up to my research page

Up to my home page