Parametrized tiling: Accurate approximations and analysis of global dependences

Aspects of parametrized tiling as applied to algorithms whose computational domain can be represented as a convex polyhedron are studied. A method for constructing approximations to a set of tiles is developed, and necessary and sufficient conditions for their accuracy are stated. Formulas for determining intertile vectors are derived. A formal representation of iteration sets generating such vectors is obtained in the form of polyhedra with explicitly expressed boundaries.


INTRODUCTION
Tiling is widely used in the design of efficient software products.The goal of tiling is to increase the graininess of an algorithm.Specifically, the set of algorithmic operations is partitioned into groups (called tiles), each regarded as a computation grain or macro operation.Due to tiling, the efficiency of using multilevel computer memory can be improved significantly and data exchange operations in parallel applications for distributed memory computer systems can be optimized.
Tiling is applied to algorithms that can be represented in the form of loop nests.Overall, there are two basic approaches to the application of tiling.The first is based on coordinate tiling (see [2,[4][5][6][7]), which is a narrow special case of tiling and suffers considerable applicability limitations.Specifically, coordinate tiling can be applied to an algorithm if its loops are completely or partially permutable.If this condition is not satisfied, the algorithm is preliminarily brought to a form acceptable for coordinate tiling by applying an affine transformation.Thus, whereas coordinate tiling directly simplifies the transformation of the algorithmic software code, the arising additional problem of searching for and executing a preliminary transformation of the algorithm can considerably complicate this tiling approach in applications.
The second approach is based on a more general form of tiling than the coordinate one [1,3,[8][9][10][11].This tiling technique has a considerably wider range of applicability and, as a consequence, does not require a preliminary affine transformation of the algorithm.However, the complexity of transforming the code of the algorithm increases.
Most tiling related studies address algorithms intended for multicore systems, i.e., shared memory sys tems (see [2, 4-7, 9, 11]).The basic task is to optimize the tile shape and size in terms of access to com puter memory and to construct a finite algorithmic code.
When tiling is applied to algorithms intended for distributed memory systems, additional aspects asso ciated with allowance for and management of interprocessor communications have to be considered (see [1,3,10]).The tiling parameters have to be chosen so as to optimize not only access to memory but also interprocessor communications.As a result, additional problems arise that are associated with identifying intertile vectors and, hence, intertile communications, determining the amount of communications, and others.
The tiling problem for distributed memory systems has been less studied than that for shared memory systems, although the former is no less important, since distributed memory systems are extensively used for massively parallel computations.
In this paper, we study some aspects of tiling as applied to algorithms intended for distributed memory systems.It is assumed that the computational domain of an algorithm can be represented as a convex poly hedron.Tiling transforms the original loop nests into ones with a larger (as a rule, doubled) number of nested loops.Formally, the set of loops can be divided into two groups: intertile loops specifying the order of algorithmic operations executed at the level of tiles and intratile loops describing the order of operations executed within a single tile.In the general case, the generation of intertile loops is a nontrivial problem, since the boundaries of the range of intertile loop parameters or the boundaries of an approximation of this range, for example, by a convex polyhedron have to be described explicitly.In the latter case, approx imations have to be constructed so as to minimize the number of "empty tiles." In addition to explicitly specifying the boundaries of intertile loops, we study the problem of determin ing global intertile vectors (vectors of dependences between tiles) and address the problem of analyzing and managing intertile communication.
In [2,[4][5][6][9][10][11] the above problems were solved under substantial constraints.More specially, the case of coordinate tiling was considered in [2,[4][5][6].In [9,10] tiling was applied to algorithms with a sim ple computational domain (a rectangular box in [9] and the intersection of special boxes in [10]).In [11] approximations to a set of tiles were presented and sufficient conditions for their accuracy as applied to algorithms whose computational domain can be represented by a convex polyhedron were found.
In this paper, a method for constructing approximations to a set of tiles is developed and necessary and sufficient conditions for their accuracy are given.Formulas for determining global intertile vectors are derived, and a formal representation of sets of iterations generating such vectors is obtained in the form of polyhedra with explicitly expressed boundaries.
1. TILING Tiling is a transformation of a numerical algorithm aimed at an increase in its graininess by uniting algorithmic operations in large computation grains-tiles.In tiling, the graininess is increased by covering the computational domain with same type n dimensional boxes (tiles).In what follows, we assume that the computational domain (index set) V of an algorithm is an n dimensional convex polyhedron.In a Car tesian coordinate system in the arithmetic space , V is defined as where is an integer matrix and is an integer vector; both are parameters determining the computational domain.Each point of V is associated with an algorithmic operation.
Let g(J) be a vector function on the set By the value of the expression , we mean a vector made up of the least values of the coordinates of g(J) on V. Note that the least values can be attained at different .The expression is defined in an analogous manner.
The arithmetic space and, hence, the computational domain V of the algorithm are partitioned into tiles by applying a nonlinear mapping : The mapping f is determined by three parameters , , and and makes it possible to identify tiles with the vector .Below, these parameters are specified without analyzing whether the corresponding tiling is well defined.We assume that the tiling parameters are chosen so that the well definedness of the tiling holds.
The matrix is composed rowwise of vectors (k = 1, 2, …, n) with relatively prime coordinates, i.e., of normal vectors to hyperplanes passing through integer points of .The vectors (k = 1, 2, …, n) determine the shape of a tile.In this paper, H is assumed to be a lower triangular matrix with 1's on the main diagonal.
The diagonal matrix determines the size of a tile, where is the number of different parallel hyperplanes with a normal vector (k = 1, 2, …, n) that pass through all integer points of a single tile.Additionally, we define , , and To use tiling in applications, the sets and are defined as where the boundary coordinates of the points and the coordinates of the points are explicitly defined in terms of the tiling parameters, the parameters determining the computational domain, and the coordinates of the corresponding points with smaller indices.In the general case, when the computational domain is given by (1.1), the corresponding definitions (1.3) and (1.4) do not contain explicit expressions for these boundaries.
Dropping the condition and thus admitting redundant computations, we use the same tiling parameters to construct larger sets of tiles that approximate the set , in which the range boundaries for the global coordinates of points are explicitly expressed in terms of the til ing parameters , the parameters of V, and the global coordinates of these points with smaller indi ces.Following the terminology of [2], these extensions are called approximations of a tile set.The accuracy of an approximation is estimated by the number of empty tiles contained in .The fewer ) ) . .
, , the empty tiles in , the more accurately it approximates the set .Obviously, an approximation is accurate (i.e., ) if it does not contain empty tiles.
In many computational algorithms, the computational domain is a convex polyhedron of form (1.1), where the dependence of the range boundaries for the coordinates of each point of this polyhedron on the integer vector I characterizing polyhedron faces and on the local coordinates of this point with smaller indices are described by explicitly defined functions.
In what follows, we assume that computational domain (1.1) of the algorithm can be represented as where the functions and satisfy the inequality for all admissible values of the parameters and for i = 1, 2, …, n and k = 1, 2, …, n.Below is a cri terion for the accuracy of an approximation in the case of computational domains representable in the form of (1.5).
, , exists a tile that does not contain any points of V.This means that, for any point , includ ing the above point , the following strict inequality holds for some index : Since the strict inequality holds if either , which is equivalent to the inequality , or , which is equivalent to the inequality .
Note that, if for any tile , there is a point such that (1.7) then inequalities (1.6) hold as well.Thus, inequalities (1.7) are a sufficient condition for the accuracy of the approximation .
2. BOUNDING METHODS In bounding methods, the computational domain of an algorithm is enlarged and accurate approxima tions to the set of tiles covering the enlarged domain are constructed.We consider two bounding methods [2].In the first, the computational domain is bounded by a box.In the second bounding method, the original polyhedral computational domain V is enlarged by parallel translations of some of its faces.

If the chosen tiling parameters
are such that, for any tile , there is a point such that then the approximation is accurate (i.e., ).
Proof.Formula (2.1) is well defined, which follows from definition (1.4) of a set of tiles covering : The accuracy of the approximation follows from Proposition 1, since (2.2) is a special case of formula (1.7), which is a sufficient condition for the accuracy of an approximation.The proposition is proved.
Obviously, if , then the box bounding method may lead to redundant computations because of the potential presence of empty tiles.
Consider the second bounding method.The polyhedral computational domain V given by (1.1) is enlarged by shifting some of its faces determined by the vector parameter I so that, in view the chosen tiling parameters , the enlarged domain contains the initial vertices of all tiles having a non empty intersection with V.
Proposition 3. Suppose that the computational domain of an algorithm can be represented as (1.5).Then, for fixed tiling parameters , an approximating tile set is defined as Given fixed tiling parameters , if for any tile and any particular value k = 1, 2, …, n, there is a point such that then the approximation is accurate.
Proof.Suppose that a tile with an initial vertex has a nonempty intersection with the compu tational domain given by (1.1).This means that the tile contains points among which there is one at which the greatest value is attained.Taking into account the defini tion = for the set of points belonging to , we have The inequality implies , which yields the definition of the set and formula (2.3).
To complete the proof, we note that inequalities (2.4) are derived by a simple transformation of ine qualities (1.7).The proposition is proved.
, , , , , ), 1,2, , , . . , ( ) Note that each of the above approximations and can contain empty tiles.Considering the intersection of these approximating sets (for the same parameters ), we can expect that the approx imation , includes fewer or no empty tiles.

GLOBAL TILE DEPENDENCES
Below, for algorithms with homogeneous dependences, we study global intertile vectors.An algorithm with homogeneous dependences is characterized by a computational domain V and a set Φ of dependence vectors.Given an algorithm with homogeneous dependences, the existence of a vec tor means that, for any pair of points the operations assigned to depend infor mationally on the results of the operations assigned to J.For tiling, the existence of information depen dence between algorithmic operations leads to a dependence between tiles.Dependences between tiles are characterized by global intertile vectors.Definition 2. A vector is a global intertile vector if there are points and such that , , , .
Thus, any pair of points from different tiles determines the existence of a global informa tion dependence between the corresponding tiles.The set of all global intertile vectors in an algorithm is denoted by . The presence of a vector means that there exists at least one pair of information ally dependent tiles , It should be noted that, in contrast to the homogeneity of the orig inal algorithm in terms of operations, in the general case, the algorithm is not homogeneous in terms of tiles.
Obviously, there is a direct relation between the values of the intertile vectors and the values of the global intertile vectors .Depending on chosen tiling parameters, each vector can gen erate a number of global intertile vectors.
Proposition 4. Given an algorithm with homogeneous dependences, let be a fixed dependence vector.Then, under tiling with parameters H, R, and , the kth coordinate of the generated global intertile vector takes at most two values, which are determined by the double inequality . ( Proof.Suppose that the dependence vector generates a global intertile vector ; i.e., there are points and such that This means that the vectors satisfy the inequalities and which are equivalent to the double inequality ≤ .This inequality implies the estimate -≤ , which means that, for any index k = 1, 2, …, n, The proposition is proved.Proposition 5. Given an algorithm with homogeneous dependences, let be a fixed dependence vector.Then, under tiling with parameters H, R, and , for any vector with and . ) ,

…, n, and for any tile
, there exists a point such that .Proof.Let the assumption of the proposition be satisfied.Fix a vector and a tile .To prove the proposition, we need to show that there exists a point satisfying two inequalities or, equivalently, the inequality For arbitrary and , the following estimates are easy to prove: By the definition of the vector , the kth coordinate of the vector is defined as Taking into account the above estimates, the coordinates of the vector are estimated as and if respectively.Then inequality (3.2) can be represented as a system of n inequalities of the form or Consider the extreme points of the tile .Any point satisfies the equality .As a parameter λ, we use the vector with coordinates and if respectively.Then the coordinates of the vector are given by These relations imply that inequality (3.2) holds for the extreme point of the tile .Proposition 5 is proved. , , ) . .

SET OF ITERATIONS GENERATING HOMOGENEOUS DEPENDENCES BETWEEN TILE OPERATIONS
The application of tiling to an algorithm generates global intertile dependences characterized by global intertile vectors .When tiling is used in the design of parallel applications intended for distributed memory systems, the existence of intertile dependences means that corresponding communications are required when a parallel program is executed.In this context, an important task is to determine the set of tile points whose operations are informationally related to those assigned to points of other tiles.
For each tile , a vector , and the corresponding generated vector , we define the following sets: , which is the set of points of that belong to the computational domain of the algorithm and are infor mationally dependent on the points of the computational domain that belong to the tile ; and , which is the set of points in whose operations directly influence the result produced by the oper ations of the points of the tile .
According to the above definitions of and , it is easy to show that (4.1) In view of relation (4.1), these sets can be formally expressed in terms of each other.By applying the fol lowing proposition, can be determined in explicit form.
Proposition 6. Suppose that the computational domain of an algorithm can be represented in the form of (1.5) .
, , , , ), 1,2, , . ) or, equivalently, by the inequality These inequalities, which determine the range boundaries for the coordinates of points of the desired domain, imply representation (4.2) of the set , as required.The proposition is proved.
Note that, by using relation (4.1) and Proposition 6, the set can be determined in a similar manner.Then the approximating sets are given by , , , .
The approximation is accurate, since, for the tiling parameters chosen and for any tile , there is a point at which double inequality (2.4) a J J a J J 2,0, ( 1)  The results obtained in this work can be used to develop and optimize both sequential and parallel algo rithms intended for distributed memory computer systems.
5. EXAMPLEConsider the following forward sweep algorithm for Gaussian elimination as applied to systems of lin ear algebraic equations:do = 1, N -1 do = + 1, N do = + 1, N + 1 : a( , ) = a( , ) -a( , )enddo enddo enddo In this example, n = 3 and the set of iterations corresponding to the operator can be represented asIn this example, we apply tiling with the parameters , ) holds.Inequalities(2.4)aresatisfied by any point with the first coordinate given by .To illustrate Proposition 6 with this example, we consider the true dependence gen erated by the intertile vector .Proposition 4 implies that, with this choice of tiling parameters, two