Overview
Loop tiling, as one of the most important compiler optimization techniques, is beneficial for both parallel machines and uniprocessors. Efficient generation of multi-level tiled code is essential to maximize data reuse in deep memory hierarchies. Tiled loops with parameterized tile sizes (not compile time constants) enable runtime optimizations used in iterative compilation and automatic tuning. Previous parametric multi-level tiling approaches have been restricted to perfectly nested loops, where all statements are contained inside the innermost loop of a loop nest. Previous solutions to tiling for imperfect loop nests have been limited to the case where tile sizes are fixed. PrimeTile provides an effective way to generate efficient parameterized multi-level tiled code for imperfectly nested loops. The generated tiled code contains loops that iterate over full rectangular tiles, making them suitable for compiler optimizations such as register tiling.
Download
- PrimeTile 0.2.0 (beta) (updated: March 25th, 2009), ChangeLog
- PrimeTile 0.3.0 (beta -- prerelease) (for latest improvements, including register tiling support), ChangeLog
Software Requirements
-
Python (widely available in any Linux/Unix distribution)
PrimeTile has been tested successfully with Python 2.5.1 (and any newer versions) on various Linux distributions. - GMP (GNU Multiple Precision Arithmetic Library -- a portable library written in C for arbitrary-precision arithmetic on integers, rational numbers, and floating-point numbers)
- PrimeTile consists of two major components: an adapted version of CLooG and Orio. Both are already included in the PrimeTile package. Nothing needs to be downloaded separately.
Installation
To install PrimeTile, we only need to build the CLooG code that is already included in the package.
- tar -xvzf primetile-X.X.X.tar.gz
- cd primetile-X.X.X
- cd cloog
- ./configure --without-polylib --with-isl=bundled --with-gmp-prefix=/path/to/gmp/installation
- make
- ./check
Note that the GMP installation directory must be explicitly specified during configuration. The last step is used for ensuring whether the adapted version of CLooG has been installed properly.
Command Line Options
'primetile' is the script wrapper around all PrimeTile's components.
% ./primetile --help
Description: compile script for PrimeTile
Usage: primetile [options] <ifile>
<ifile> Input file containing the context, the statement domains, and the scattering functions
Options for code generation:
-t <level> | --tilelevel=<level> The number of tiling levels (0: no tiling)
(default setting: 1)
-b <level> | --boundarylevel=<level> The largest tiling level used for tiling the boundary tiles
(0: no boundary-tile tiling, -1: tilelevel-1)
(default setting: -1)
-f <depth> | --first=<depth> The first loop depth to start tiling (-1: infinity)
(default setting: 1)
-l <depth> | --last=<depth> The last loop depth to stop tiling (-1: infinity)
(default setting: -1)
General options:
-o <file> | --output=<file> Place the output to <file>
-h | --help Display this usage message
-q | --quiet Don't print any details of the running program
Using PrimeTile
In this section, we illustrate how to write the input file and how to use PrimeTile. PrimeTile takes as input a file that contains a description of the loop nests to be tiled. The output of PrimeTile is a parametric multi-level tiled C code.
Let us consider the 1D Jacobi code example and its corresponding iteration space (for T=6 and N=6) given below. The blue arrows denote loop-carried data dependences among statement instances. The iteration space of 1D Jacobi is two dimensional and it contains two different statements (that is, S1 and S2).
for (t=0; t<=T-1; t++) {
for (i=1; i<=N-2; i++)
B[i] = 0.33333*(A[i-1]+A[i]+A[i+1]); /* S1 */
for (i=1; i<=N-2; i++)
A[i] = B[i]; /* S2 */
}
|
|
A set of affine constraints can be used to describe each statement's iteration space (called a statement domain). In the 1D Jacobi example, the statement domain of S1 can be expressed using the following set of affine constraints, where t,i are the loop iterators, and T,N are the global parameters. Note that the statement domain of S2 is exactly the same as that of S1.
0<=t<=T-1 1<=i<=N-2
Additionally, we also consider a partial knowledge of the values of the global parameters (called the context), expressed as affine constraints as well, such as the following:
T>=1 N>=3
The affine constraints of both the statement domains and the context are what we will provide to PrimeTile as input. Note that PrimeTile's input must be written accordingly to CLooG's input syntax, which is described in depth in the CLooG's documentation. The domains and the context of 1D Jacobi written in CLooG's input syntax (and saved in a file named '1d-jacobi.cloog') are provided in the following.
% more 1d-jacobi.cloog # ------------- LANGUAGE ------------- # Language is C c # ------------- CONTEXT ------------- 2 4 # 2 lines and 4 columns # eq/in T N 1 1 1 0 -1 # 1*T + 0*N -1*1 >= 0, i.e., T>=1 1 0 1 -3 # 0*T + 1*N -3*1 >= 0, i.e., N>=3 1 # We want to set manually the parameter names T N # ------------- STATEMENT DOMAINS ------------- # Number of statements 2 # First domain 1 # One domain 4 6 # 4 lines and 6 columns # eq/in t i T N 1 1 0 1 0 0 -1 # i>=1 1 0 -1 0 1 -2 # N-2>=i 1 1 0 0 0 0 # t>=0 1 -1 0 1 0 -1 # T-1>=t 0 0 0 # Other options # Second domain 1 # One domain 4 6 # 4 line and 6 columns # eq/in t i T N 1 1 0 1 0 0 -1 # i>=1 1 0 -1 0 1 -2 # N-2>=i 1 1 0 0 0 0 # t>=0 1 -1 0 1 0 -1 # T-1>=t 0 0 0 # Other options 1 # We want to set manually the loop iterator names t i # Other options 0
Using the command given below, we can generate the corresponding unoptimized loop structure (i.e., untiled code) using PrimeTile. However, the generated untiled code (stored in file '1d-jacobi.primetile.c') will give incorrect results. This incorrectness is because the loops of statements S1 and S2 are completely fused and this fusion changes the order of statement execution, resulting in some data dependence violations. For instance, the fact that S2(0,0) is executed before S1(0,1) clearly violates the data dependence vector (0,-1).
% ./primetile --tilelevel=0 1d-jacobi.cloog; more 1d-jacobi.primetile.c
int t,i;
for (t=0; t<=T-1; t+=1) {
for (i=1; i<=N-2; i+=1) {
S1(t,i);
S2(t,i);
}
}
To fix this correctness problem, we can enforce the loop scanning order to respect the data dependences. This can be done by adding scattering functions in the input file, as seen below.
% more 1d-jacobi.cloog # ------------- LANGUAGE ------------- # language: C c # ------------- CONTEXT ------------- # Context 2 4 # 2 lines and 4 columns # eq/in T N 1 1 1 0 -1 # 1*T + 0*N -1*1 >= 0, i.e., T>=1 1 0 1 -3 # 0*T + 1*N -3*1 >= 0, i.e., N>=3 1 # We want to set manually the parameter names T N # ------------- STATEMENT DOMAINS ------------- # Number of statements 2 # First domain 1 # One domain 4 6 # 4 lines and 6 columns # eq/in t i T N 1 1 0 1 0 0 -1 # i>=1 1 0 -1 0 1 -2 # N-2>=i 1 1 0 0 0 0 # t>=0 1 -1 0 1 0 -1 # T-1>=t 0 0 0 # Other options # Second domain 1 # One domain 4 6 # 4 line and 6 columns # eq/in t i T N 1 1 0 1 0 0 -1 # i>=1 1 0 -1 0 1 -2 # N-2>=i 1 1 0 0 0 0 # t>=0 1 -1 0 1 0 -1 # T-1>=t 0 0 0 # Other options 1 # We want to set manually the loop iterator names t i # ------------- SCATTERING FUNCTIONS ------------- # Number of scattering functions 2 # First scattering function 3 9 # 3 lines and 9 columns # eq/in _t _i _j t i T N 1 0 1 0 0 -1 0 0 0 0 # _t = t 0 0 1 0 0 0 0 0 0 # _i = 0 0 0 0 1 0 -1 0 0 0 # _j = i # Second scattering function 3 9 # 3 lines and 9 columns # eq/in _t _i _j t i T N 1 0 1 0 0 -1 0 0 0 0 # _t = t 0 0 1 0 0 0 0 0 -1 # _i = 1 0 0 0 1 0 -1 0 0 0 # _j = i # We want to set manually the scattering dimension names 3 _t _i _j
With such additions of scattering functions, we now have a generated code (shown below) that is correct and matches with the original 1D Jacobi code.
% ./primetile --tilelevel=0 1d-jacobi.cloog; more 1d-jacobi.primetile.c
int _t,_j;
for (_t=0; _t<=T-1; _t+=1) {
for (_j=1; _j<=N-2; _j+=1) {
S1(_t,_j);
}
for (_j=1; _j<=N-2; _j+=1) {
S2(_t,_j);
}
}
Important: In order for PrimeTile to generate a correct tiled code, it is very important for users to ensure that the input program of PrimeTile is rectangularly tileable. This means that all data dependences are lexicographically non-negative in all space dimensions (that is, there is no backward data dependences).
If the iteration space is not originally rectangularly tileable, users will need to transform the input program to make rectangular tiling valid. Transforming the input program for rectangular tileability can be done by explicitly specifying the scanning order of the iteration space using scattering functions. The 1D Jacobi example is not tileable because it has dependence vectors containing negative components, i.e., (0,-1) and (1,-1). The skewing transformation presented below can be used to eliminate these backward dependences. As seen in the following figure, the iteration space of 1D Jacobi code after skewing no longer has backward dependences and, hence, rectangular tiling now becomes valid.
|
Affine transformation (skewing):
S1: (t',i') = (t,2*t+i) S2: (t',i') = (t,2*t+i+1) |
|
There are tools that, for a given set of loop nests, can automatically generate scattering functions that satisfy the tileability condition. One example of such tools is Pluto, a state-of-the-art tiling-based parallelizer and locality optimizer for multicore processors. The .cloog file that PrimeTile needs as an input can be readily obtained by running Pluto with the --debug option (and the generated .opt.cloog file will become PrimeTile's input).
The following input program corresponds to a rectangularly tileable 1D Jacobi loop code, skewed using the affine transformations produced by Pluto (displayed in the figure above).
% more 1d-jacobi.cloog # ------------- LANGUAGE ------------- # language: C c # ------------- CONTEXT ------------- # Context 2 4 # 2 lines and 4 columns # eq/in T N 1 1 1 0 -1 # 1*T + 0*N -1*1 >= 0, i.e., T>=1 1 0 1 -3 # 0*T + 1*N -3*1 >= 0, i.e., N>=3 1 # We want to set manually the parameter names T N # ------------- STATEMENT DOMAINS ------------- # Number of statements 2 # First domain 1 # One domain 4 6 # 4 lines and 6 columns # eq/in t i T N 1 1 0 1 0 0 -1 # i>=1 1 0 -1 0 1 -2 # N-2>=i 1 1 0 0 0 0 # t>=0 1 -1 0 1 0 -1 # T-1>=t 0 0 0 # Other options # Second domain 1 # One domain 4 6 # 4 line and 6 columns # eq/in t i T N 1 1 0 1 0 0 -1 # i>=1 1 0 -1 0 1 -2 # N-2>=i 1 1 0 0 0 0 # t>=0 1 -1 0 1 0 -1 # T-1>=t 0 0 0 # Other options 1 # We want to set manually the loop iterator names t i # ------------- SCATTERING FUNCTIONS ------------- # Number of scattering functions 2 # First scattering function 3 9 # 3 lines and 9 columns # eq/in _t _i _j t i T N 1 0 1 0 0 -1 0 0 0 0 # _t = t 0 0 1 0 -2 -1 0 0 0 # _i = 2*t+i 0 0 0 1 0 0 0 0 0 # _j = 0 # Second scattering function 3 9 # 3 lines and 9 columns # eq/in _t _i _j t i T N 1 0 1 0 0 -1 0 0 0 0 # _t = t 0 0 1 0 -2 -1 0 0 -1 # _i = 2*t+i+1 0 0 0 1 0 0 0 0 -1 # _j = 1 # We want to set manually the scattering dimension names 3 _t _i _j
A more readable version of the tileable untiled code of 1D Jacobi is shown in the following, generated using PrimeTile.
% ./primetile --tilelevel=0 1d-jacobi.cloog; more 1d-jacobi.primetile.c
int _t,_i;
if (N>=4) {
for (_t=0; _t<=T-1; _t+=1) {
S1(_t,1);
for (_i=2*_t+2; _i<=2*_t+N-2; _i+=1) {
S1(_t,-2*_t+_i);
S2(_t,-2*_t+_i-1);
}
S2(_t,N-2);
}
}
if (N==3) {
for (_t=0; _t<=T-1; _t+=1) {
S1(_t,1);
S2(_t,1);
}
}
We can optionally provide PrimeTile with the macro definitions of all statements. In the 1D Jacobi example, we specify the details of statements S1 and S2 in '1d-jacobi.stmts.c' whose content is as follows.
% more 1d-jacobi.stmts.c #define S1(t,i) B[i]=0.33333*(A[1+i]+A[i]+A[i-1]) #define S2(t,i) A[i]=B[i]
So, to tile the 1D Jacobi code:
% ./primetile 1d-jacobi.cloog; more 1d-jacobi.primetile.c
int _t,_i,_tt1,_it1,lb1,ub1;
if (N>=4) {
for (_tt1=0; _tt1<=T-1-(T1_t-(1)); _tt1+=T1_t) {
lb1=2*(_tt1+(T1_t-(1)))+2;
ub1=2*(_tt1)+N-2;
if (lb1<ub1) {
for (_t=_tt1; _t<=_tt1+(T1_t-(1)); _t+=1) {
B[1]=0.33333*(A[1+1]+A[1]+A[1-1]);
for (_i=2*_t+2; _i<=lb1-(1); _i+=1) {
B[-2*_t+_i]=0.33333*(A[1+-2*_t+_i]+A[-2*_t+_i]+A[-2*_t+_i-1]);
A[-2*_t+_i-1]=B[-2*_t+_i-1];
}
}
for (_it1=lb1; _it1<=ub1-(T1_i-(1)); _it1+=T1_i) {
/* start of full tiles: _t,_i */
for (_t=_tt1; _t<=_tt1+(T1_t-(1)); _t+=1) {
for (_i=_it1; _i<=_it1+(T1_i-(1)); _i+=1) {
B[-2*_t+_i]=0.33333*(A[1+-2*_t+_i]+A[-2*_t+_i]+A[-2*_t+_i-1]);
A[-2*_t+_i-1]=B[-2*_t+_i-1];
}
}
/* end of full tiles: _t,_i */
}
for (_t=_tt1; _t<=_tt1+(T1_t-(1)); _t+=1) {
for (_i=_it1; _i<=2*_t+N-2; _i+=1) {
B[-2*_t+_i]=0.33333*(A[1+-2*_t+_i]+A[-2*_t+_i]+A[-2*_t+_i-1]);
A[-2*_t+_i-1]=B[-2*_t+_i-1];
}
A[N-2]=B[N-2];
}
} else {
for (_t=_tt1; _t<=_tt1+(T1_t-(1)); _t+=1) {
B[1]=0.33333*(A[1+1]+A[1]+A[1-1]);
for (_i=2*_t+2; _i<=2*_t+N-2; _i+=1) {
B[-2*_t+_i]=0.33333*(A[1+-2*_t+_i]+A[-2*_t+_i]+A[-2*_t+_i-1]);
A[-2*_t+_i-1]=B[-2*_t+_i-1];
}
A[N-2]=B[N-2];
}
}
}
for (_t=_tt1; _t<=T-1; _t+=1) {
B[1]=0.33333*(A[1+1]+A[1]+A[1-1]);
for (_i=2*_t+2; _i<=2*_t+N-2; _i+=1) {
B[-2*_t+_i]=0.33333*(A[1+-2*_t+_i]+A[-2*_t+_i]+A[-2*_t+_i-1]);
A[-2*_t+_i-1]=B[-2*_t+_i-1];
}
A[N-2]=B[N-2];
}
}
if (N==3) {
for (_tt1=0; _tt1<=T-1-(T1_t-(1)); _tt1+=T1_t) {
/* start of full tiles: _t */
for (_t=_tt1; _t<=_tt1+(T1_t-(1)); _t+=1) {
B[1]=0.33333*(A[1+1]+A[1]+A[1-1]);
A[1]=B[1];
}
/* end of full tiles: _t */
}
for (_t=_tt1; _t<=T-1; _t+=1) {
B[1]=0.33333*(A[1+1]+A[1]+A[1-1]);
A[1]=B[1];
}
}
In the output tiled code, the tile sizes are kept as symbolic parameters. Users can either declare the tile sizes as integer variables or define them as macros. The variable names of the tile sizes are in the fixed form of T{N}{i}, where N is the tiling level and i is the corresponding loop iterator name. For example, T2_i and T1_i are the level-2 tile size and the level-1 tile size for loop _i, respectively. Note that it is user's responsibility to make sure that a higher-level tile size is always a multiple of its corresponding lower-level tile size (e.g., T2_i % T1_i = 0).
If the file containing the macro definitions of all statements is not provided by users, PrimeTile will generate the tiled loops with statements in unexpanded forms (e.g., S1(_t,_i) and S2(_t,_i)). Also, loops that enumerate full rectangular tiles are enclosed in a segment annotated with /* start of full tiles */ and /* end of full tiles */. The identification of these full-tile loops can facilitate other compiler optimizations such as register tiling, vectorization, and software pipelining.
Several other examples of PrimeTile input files are provided in the 'examples' directory.
Steps in Code Generation By PrimeTile
As a pre-processing step, Pluto can be used for the extraction of polyhedral representation of the input loop iteration space, and for the generation of affine schedules that satisfy rectangular tileability condition. An adapted version of CLooG is used as PrimeTile's front-end to generate the rectangularly tileable loop code with preserved embedding information for all statements. The parser, the AST classes and the corresponding code generator available in Orio are used. The output of the parser feeds the tiling transformation module. The tiling module takes as input the loop ASTs extracted from CLooG output, and recursively transforms the ASTs for tiling. Parametric multi-level C code is finally generated by the code generator.
Adapted Version of CLooG
PrimeTile requires an adapted version of CLooG that can generate
loop code containing redundant "one-time" loops so that all
statements in every loop nest have as many surrounding loops as the
dimensionality of the corresponding iteration space.
The source code of the adapted CLooG can be downloaded via the following link.
Modified CLooG (developmental version 0.14.0-132)
To build the modified version of CLooG:
- tar -xvzf cloog.tar.gz
- cd cloog
- ./configure --without-polylib --with-isl=bundled --with-gmp-prefix=/path/to/gmp/installation
- make
- ./check
The GMP installation directory must be specified during configuration. The last step is to ensure that the installation succeeds.
Papers
-
Parametric Multi-Level Tiling of Imperfectly Nested Loops [ PDF | Extended version ]
Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan.
ACM International Conference on Supercomputing (ICS), June 2009, Yorktown Heights, New York. -
Parameterized Tiling Revisited
Muthu Manikandan Baskaran, Albert Hartono, Thomas Henretty, J. Ramanujam, and P. Sadayappan.
IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Apr 2010, Toronto, Canada. (accepted) -
Parametric Tiled Loop Generation for Effective Parallel Execution on Multicore Processors
Albert Hartono, Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan.
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Apr 2010, Atlanta, Georgia. (submitted)
Funding
This work was supported in part by the National Science Foundation through awards 0403342, 0508245, 0509442, 0509467, 0541409, 0811457, and 0811781, and a State of Ohio Development Fund.
Related Work
Pluto -- a state-of-the-art polyhedral-based parallelizer and locality optimizer for multicore processors
CLooG -- a loop code generator for scanning Z-polyhedra
HiTLOG -- a tiled loop generator for perfectly nested loops

