OSU logo

PLUTO - An automatic parallelizer and locality optimizer for multicores


PLUTO is an automatic parallelization tool that is based on the polyhedral model. The polyhedral model is a geometrical representation for programs that utilizes machinery from Linear Algebra and Linear Programming for analysis and high-level transformations. Pluto transforms C programs from source to source for coarse-grained parallelism and locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling and fusion, but not limited to it. OpenMP parallel code for multicores can be automatically generated from sequential C program sections. Outer, inner, or pipelined parallelization is achieved (purely with OpenMP pragrams), besides register tiling and making the code amenable for auto vectorization. Though the tool is fully automatic (C to OpenMP C), it provides a number of options (both command-line and through optional files) to tune aspects like tile sizes, unroll factors, and the outer level loop fusion structure. CLooG is used for code generation. A beta release can be downloaded below. Parallelized/optimized codes for some examples are also available separately below.


WARNING: The current frontend used by Pluto (LooPo's frontend) only accepts a very small subset of C -- sequences of arbitrarily nested loops with strictly affine array accesses. The polyhedral model is by no means restricted to such programs. Current state-of-the-art in the polyhedral model can accept any non-recursive program (assuming procedures in the portion of the program being optimized are either inlined or are pure function calls). Conditionals, accesses, and loop bounds need not be affine; they can be non-affine and data-dependent (dynamic); While loops are not a problem either. The Clan frontend will support extracting polyhedral representations from such general programs and will be integrated into Pluto in future. A pre-release with Clan/Candl (0.5.0) is available below.

Download

Pluto 0.4.2 (BETA), README, ChangeLog

Pluto 0.5.0-pre

All libraries that Pluto depends on (PipLib, PolyLib, and CLooG) are included and are built automatically. So nothing needs to be downloaded or installed separately.
All released versions to date

Quick Install

  1. tar zxvf pluto-0.4.2.tar.gz
  2. cd pluto-0.4.2/
  3. ./configure [--enable-debug]
  4. make
$ ./polycc test/seidel.c --tile --parallel

Number of variables: 3
Number of parameters: 2
Maximum domain dimensionality: 3
Number of loops: 3
Number of poly deps: 27

(PLUTO) Affine transformations (<var coeff's> <const>)

T(S1): (t, t+i, 2t+i+j)
3 4
 1  0  0  0
 1  1  0  0
 2  1  1  0

t0 --> fwd_dep  loop   (band 0)
t1 --> fwd_dep  loop   (band 0)
t2 --> fwd_dep  loop   (band 0)

[Pluto] Outermost tilable band: t0--t2
[Pluto] After tiling:
t0 --> serial   tLoop  (band 0)
t1 --> parallel tLoop  (band 0)
t2 --> fwd_dep  tLoop  (band 0)
t3 --> fwd_dep  loop   (band 0)
t4 --> fwd_dep  loop   (band 0)
t5 --> fwd_dep  loop   (band 0)

[PLUTO] using CLooG -f/-l options: 4 6
Output written to ./seidel.par.c

icc -O3 -openmp -lm seidel.par.c -o par  -DTIME
seidel.par.c(43): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
            
2-d 
                        Gauss Seidel

Trying the examples

Original and transformed codes for comparing performance (same as the ones in the reports/publications) can be

browsed online here, or
downloaded

Experiments with these are in OSU-CISRC-10/07-TR70 (PLUTO: A Practical Automatic Polyhedral Program Optimization System, Bondhugula et al.) For exact reproduction of results, please use ICC to compile all codes on an Intel Core 2 Quad Q6600. Similar relative gains should be seen with GCC (version >= 4.1/4.2 have OpenMP support) or on other multicores. To check what polyhedral transformations were applied, please see the appropriate .cloog files in the directories.


  1. All examples can be run readily. In any particular example's directory, run 'make' to build binaries and 'make clean' to remove them. Timers are included and running time of the kernel is printed out.

  2. In any example directory:
    <name>.c is the original input code, orig is the corresponding binary
    <name>.tiled.c is the transformed locality tiled code (no parallelization), tiled is the corresponding binary.
    <name>.par.c is the parallelized-cum-locality optimized code, par is the corresponding binary. This is usually the best code from Pluto.
    <name>.par2d.c is the parallelized-cum-tiled code with two degrees of parallelism (outer, inner, or pipelined), where applicable, par2d is the corresponding binary.
    <name>.opt.c is the code generated just with the transformation applied (i.e., no actual tiled or parallelized code) - this is usually not the code you want to run, but just to understand some details that are tough to see in the parallelized/tiled code

    After doing a 'make', to run the original code:
    $ ./orig
    To run the pluto parallelized version (on a multicore with say 4 threads):
    $ export OMP_NUM_THREADS=4; ./par
    To run the pluto tiled version (non-parallelized, local tiled)
    $ ./tiled

  3. To change the compiler options or flags (it's currently set to GCC), edit common.mk in the top-level directory. GCC 4.1/4.2 have OpenMP support and the necessary flags are already set. The '-parallel' flag is used to generate 'orig_par' (original code auto-parallelized with ICC).

  4. To change problem sizes, edit decls.h.

  5. To test correctness of codes, run 'make test'

  6. CLooG input files corresponding to all transformed codes can also be found in the directories. In addition, for some examples, transformations computed by some previous frameworks (Lim/Lam affine partitioning and scheduling-based approaches) were forced by editing the CLooG file (that was automatically generated by PLUTO) and short-circuiting the tool chain.

Performance

All results are on an Intel Core2 Quad Q6600 on Linux x86-64 (2.6.18) with ICC 10.1 used to compile original and transformed/parallelized codes.

Imperfect Jacobi stencil

Jacobi (single core) Jacobi (multiple cores)

2-d FDTD

LU LU

LU decomposition

LU LU

GEMVER, Doitgen

GEMVER Doitgen

DGEMM

LU LU

ATLAS was tuned with GCC 4.1. kSelMM perf numbers were approximated using the timing report produced after tuning ATLAS ('make time'). kSelMM, kGenMM perf on 2, 3 cores was interpolated (linearly) from perf on 1 and 4 cores.


Please email me reg. anything on Pluto.


Related Projects

Up to my research page

Up to my home page