# # PLuto README # # Uday Bondhugula # # INSTALLING PLUTO Requirements: A reasonable Linux distribution. It has been tested on x86 and x86-64 machines running Fedora Core {4,5,7,8,9}, Ubuntu, and RedHat Enterprise Server 5.x. Solaris should also be fine if you have good utils. Pluto comes with all libraries that it depends on included. The configuration system (autoconf/automake) will take care of automatically building everything. Overall, nothing needs to be downloaded and installed separately. Cloog, PipLib and Polylib are the included libraries besides LooPo modules. BUILDING PLUTO Just run 'install from Pluto's top-level directory $ tar zxvf pluto-0.3.0.tar.gz $ cd pluto-0.3.0/ $ ./install $ make test OR $ tar zxvf pluto-0.3.0.tar.gz $ cd pluto-0.3.0/ $ ./configure [--enable-debug] $ make $ make test If you do not have ICC, uncomment and comment line 7 and line 8 (respectively) of examples/common.mk. 'polycc' is the script wrapper around src/pluto (core transformer) and all other components. 'polycc' runs all of these in sequence on an input C program (with the section to parallelize/optimize marked) and is what a user should use on input. The output generated is OpenMP parallel C code that can be readily compiled and run on shared memory parallel machines like multicores. TRYING A NEW CODE - Use /* pluto start () */ and /* pluto end */ around the section of code you want to parallelize/optimize. The parameter list is a list of program parameters (typically symbols appearing in loop bounds) separated by commas, like /* pluto start (M,N) */ - Then, just run ./polycc --parallel --tile The transformation is also printed out, and test.par.c will have the parallelized code. If you want to see the intermediate files, like the .cloog file, the dependence message file, use the --debug option. See the next section for the whole range of options. - Tile sizes can be specified in a file 'tile.sizes', otherwise default sizes will be set. Default tile sizes may usually be good enough to give significant improvement. See doc/DOC.txt on how to specify the sizes. For running a good number of experiments on a code, it is best to use the setup created for the example codes in the examples/ directory - Just copy one of the sample directories, edit Makefile (SRC = ), util.h, decls.h appropriately (put your problem size declarations in decls.h) - Now, do a make (this will build all the executables; 'orig' is the original code, 'tiled' is the tiled code, 'par' is the OpenMP parallelized+locality optimized code; 'par2d' is with two degrees of parallelism whenever it exists). Alternately, one could do 'make tiled', 'make par', 'make orig', or 'make opt' - 'make test' to test for correctness COMMAND-LINE options with polycc --tile [--l2tile] Tile code; in addition, --l2tile will tile once more for the L2 cache. By default, both of them are disabled. Tile sizes can be forced if needed from a file 'tile.sizes' (see below), otherwise, tile sizes are set automatically using a heuristic. --parallel [--multipipe] Parallelize code using OpenMP (usually only makes sense when used with --tile) --parallelize Same as --parallel --multipipe Will enable extraction of multiple degrees of parallelism (upto 2 as of now). Disabled by default. By default, only one degree of outer parallelism or coarse-grained pipelined parallelism is extracted. In this case, the generated file will have ".par2d.c" suffixed. --smartfuse [default] This is the default fusion heuristic. Will try to fuse between SCCs of the same same dimensionality. --nofuse Separate all strongly-connected components in the dependence graphs to start with, i.e., no fusion across SCCs, and at any level inside. --maxfuse This is geared towards maximal fusion (but not actually maximal fusion). Optimization is done across SCCs. --[no]unroll Automatically identify and unroll up to two loops. Not enabled by default. --ufactor= Unrolling or Unroll-jam factor --[no]prevector Perform post-transformations to make the code amenable to vectorization. Enabled by default. --rar Consider RAR dependences for optimization (increases running In this case, the generated file will have ".par2d.c" suffixed. time by a little). Disabled by default --debug Verbose information to give some insights into the algorithm. Intermediate files are not deleted (like the program-readable statement domains, dependences, pretty-printed dependences, the .cloog file, etc.). For the format of these files, refer doc/DOC.txt for a pointer. --verbose Higher level of output. ILP formulation constraints are pretty-printed out dependence-wise, along with solution hyperplanes at each level. Besides these, 'tile.sizes' and '.fst' files allow the user to force certain things. See doc/DOC.txt for these. Other options will only make sense to power users. See comments in src/pluto.h for details. TRYING ANY INCLUDED EXAMPLE CODE Let us say we are trying the 2-d Gauss Seidel. Do a 'make par', this will generate seidel.par.c from seidel.c and also compile it to generate 'par'. Likewise, 'make tiled' for 'tiled' and 'make orig' for the 'orig'. $ cd examples/seidel seidel.orig.c: This is the original code (the kernel in this code is passed to the LooPo frontend for optimization through the tool) seidel.opt.c: This is the transformed code without tiling (this is not of much use, except for seeing the benefits of fusion in some cases) seidel.tiled.c: This the pluto tiled code generated from the tool - this should be used for single core execution (not parallelized) seidel.par.c: This is the pluto parallelized+locality tiled code. This has OpenMP pragmas and the code is L1 tiled or L1 and L2 tiled. seidel.par2d.c: In this case, we have two degrees of pipelined parallelism, so the .par2d.c is the code with nested parallel OpenMP pragmas. - To change the compiler or any of the flags, edit the top section of examples/common.mk - To enable L2 tiling, edit the PLC_FLAGS variable in Makefile - To manually specify tile sizes, create tile.sizes (see lu/ for example). Tile size for each dimension on a separate line; L1 tile size followed L2. - orig (orig_par is the icc auto-parallelized one), tiled, par and par2d are the corresponding executables; they already have timers; you just have to run them and that will print the execution time as well So, to run the pluto parallelized version: $ export OMP_NUM_THREADS=4; ./par To run the ICC auto-parallelized version: $ export OMP_NUM_THREADS=4; ./orig_par To run the original unparallelized code (compiled with icc -fast) $ ./orig To run the pluto tiled version (non-parallelized, local tiled) $ ./tiled - 'make clean' in the particular example's directory removes all the executables as well as the generated codes MORE INFO * For specifying custom tile sizes through 'tile.sizes' file, see doc/DOC.txt * For specifying custom fusion structure through '.fst' file, see doc/DOC.txt * See LOOPO_CHANGES for list of LooPo modules included and changes made to those * See cloog-0.14.1/PLUTO_CHANGES for minor changes made to Cloog's configure.in * See piplib-1.3.6/PLUTO_CHANGES for minor changes made to Cloog's configure.in * See doc/DOC.txt for an overview of the system and more details * WARNING: The current state of pluto code makes it not suitable for being worked on by anyone other than me. However, it is under active development and expect the code to get cleaner and much more readable by release 0.3.x. Please contact me if you want to build things on top of it in the meanwhile. CONTACT Please send all bugs reports and comments to Uday Bondhugula