Some Papers
CALMANT: A Systematic Method for the Execution of Hypercube Algorithms
in Multiprocessor Systems
Authors: Luis Díaz de Cerio, Miguel Valero-García, Antonio González, Dolors Royo
Reference: Computacion y Sistmeas Vol.4, No.4, pp.298-305, April 2001.
Abstract: In this work we present the CALMANT (CC-cube Algorithms on Meshes and Tori) method as a systematic method for the execution of a certain type of algorithms
that are denominated CC-cube algorithms, on meshes and tori of several dimensions.
It is very frequent to find CC-cube algorithms in literature (FFT, complete exchange, some methods for the single value decomposition and eigenvalue computation, etc.)
but the direct application of these algorithms does not allow to exploit efficiently the bandwidth that the interconnection network offers in meshes and tori.
The CALMANT method allows us to reorganize the computation and communication of the CC-cube algorithms so that the efficiency increases remarkably.
The importance of this method not only lies in the improvement of the efficiency but also it can be applied in a systematic way on different types of architectures.
CALMANT: Un Método Sistemático para la Ejecución de Algoritmos Hipercubo
en Sistemas Multiprocesador
Authors: Luis Díaz de Cerio, Miguel Valero-García, Antonio
González, Dolors Royo
Reference: Computacion y Sistmeas Vol.4, No.4, pp.289-297, April 2001.
Abstract: En este trabajo presentamos el método CALMANT (CC-cube Algorithms on
Meshes
And Tori) como un método sistemático para la ejecución de un cierto tipo de
algoritmos, que denominamos algoritmos CC-cubo, sobre mallas y toros de varias
dimensiones. Es muy frecuente encontrar algoritmos CC-cubo en la literatura
(FFT, complete exchange, cálculo de valores y vectores propios, etc.) pero
la aplicación directa de estos algoritmos no permite explotar eficientemente
el ancho de banda
que nos ofrecen las redes de interconexión en malla y toro. El método CALMAN
T
permite reorganizar los cálculos y las comunicaciones de los algoritmos CC-cubo
de manera que la eficiencia aumente notablemente. La importancia de este método
no sólo radica en el aumento de la eficiencia sino que además puede ser
aplicado de forma sistemática sobre diferentes tipos de arquitectura.
Active Yellow Pages: A Pipelined Resource Management Architecture for Wide-Area Network-Computing
Authors: Dolors Royo, Nirav H. Kapadia, Jose A. B. Fortes, Luis Díaz de Cerio
Reference: In proceedings of the 10th High Performance Distributed
Computing, pp.147-157, August 2001.
Abstract: This paper describes a novel, pipelined resource management
architecture for computational grids. The design is based on two key realizations.
One is that resource management involves a sequence of tasks that is best handled by a pipeline.
As shown in the paper, this approach results in a highly-scalable architecture for decentralized scheduling.
The other realization is that static aggregation of resources for improved schedulling is inadequate in wide-area computing environments because the needs of users and jobs change with both, location and time.
The described architecture addresses this problem by dynamically aggregating resources in a manner that continously optimizes system response.
This is accomplished by way of an active yellow pages directory that allows aggregation constraints to be (re)defined on the fly.
Active Yellow Pages (ActYP): a Resource Management System for PUNCH Educational Laboratories
Authors: Dolors Royo, Luis Díaz de Cerio
Reference: In proceedings of the European Perspectives on Computer-Supported Collaborative Learning,
pp.688-689,March 2001.
Complete Exchange Algorithms for Meshes and Tori Using a Systematic Approach
Authors: Luis Díaz de Cerio, Miguel Valero-García, Antonio González
Reference: In proceedings of the Europar00, pp.591-594, August 2000.
Abstract: Frequently, algorithms for a given multicomputer architecture cannot be used (or are not efficient) for a different architecture.
The proposed method allows the systematic design of complete exchange
algorithms for meshes and tori and it can be extended to some other
architectures that may be interesting in the future.
Divide-and-Conquer Algorithms on Two-Dimensional Meshes
Authors: Miguel Valero-García, Antonio González, Luis Díaz de Cerio and Dolors Royo
Reference: In proceedings of the Europar98, pp.1051-1056, September 1998.
Abstract: The Reflecting and Growing mappings have been proposed to map parallel
divide-and-conquer algorithms onto two - dimensional meshes. The performance properties of these
mappings have been analysed under the assumption that the parallel algorithm is initiated always
at the same fixed node of the mesh. In this scenario, the Reflecting mapping is optimal for meshes
with wormhole routing and the Growing mapping is very close to the optimal for meshes with
store-and-forward routing. In this paper we consider a more general scenario in which the parallel
divide-and-conquer algorithm can be started from an arbitrary node of the mesh.
We propose and approach that is simpler than both the Reflecting and Growing mappings, is also optimal
for wormhole meshes and better than the Growing mapping for store-and-forward meshes.
A Method For Exploiting Communication/Computation Overlap in Hypercubes
Authors: Luis Díaz de Cerio, Miguel Valero-García and Antonio González
Reference: Parallel Computing, Vol.24, No.2, pp.221-245, June 1998.
Abstract: This paper presents a method to derive efficient algorithms for hypercubes.
The method exploits two features of the underlying hardware: a) the parallelism provided by the multiple
communication links of each node and b) the possibility of overlapping computations and communications which is a
feature of machines supporting an asynchronous communication protocol. The method can be applied to a generic
class of hypercube algorithms whose distinguishing features are quite frequent in common algorithms for
hypercubes. Many examples of this class of algorithms are found in the literature for different problems. The
paper shows the efficiency of the method for two case studies. The results show that the reduction in
communication overhead is very significant in many cases. They also show that the algorithms produced by our
method are always very close to the optimum in terms of execution time.
CALMANT: Un Método Sistemático para la Ejecución de Algoritmos con
Topología Hipercubo en Mallas y Toros
Authors: Luis Díaz de Cerio, Miguel Valero-García, Antonio
González and Dolors Royo
Reference: In proceedings of the VIII Jornadas de Paralelismo,
pp.231-240, September 1997.
Abstract: En este trabajo presentamos el método CALMANT (CC-cube
Algorithms on Meshes And Tori) como un método sistemático para la ejecución de
un cierto tipo de algoritmos que denominamos algoritmos CC-cubo sobre mallas y
toros de varias dimensiones. Es muy frecuente encontrar algoritmos CC-cubo en
la literatura (FFT, complete exchange, cálculo de valores y vectores propios,
etc.) pero la aplicación directa de estos algoritmos no permite explotar
eficientemente el ancho de banda que nos ofrecen las redes de interconexión en
malla y toro. El método CALMANT permite reorganizar los cálculos y las
comunicaciones de los algoritmos CC-cubo de manera que la eficiencia aumente
notablemente. La importancia de este método no sólo radica en el aumento de la
eficiencia sino que además puede ser aplicado de forma sistemática sobre
diferentes tipos de arquitectura.
Communication Pipelining in Hypercubes
Authors: Luis Díaz de Cerio, Antonio González and Miguel
Valero-García
Reference: Parallel Processing Letters, Vol.6, No.4, pp.507-523, December 1996.
Abstract: Many parallel algorithms exhibit a hypercube communication
topology. Such algorithms can easily be executed on a multicomputer with a
hypercube interconnection topology. However, in most cases these parallel
algorithms only make use of a small fraction of the interconnection bandwidth
offered by the multicomputer. In particular, each processor of a hypercube
multicomputer is connected to d different neighbors by d different
links. Nevertheless, hypercube algorithms usually do not use more than one
of these d links at the same time. This paper presents a technique called
communication pipelining that enables a more efficient use of the
interconnection network and, in consequence, a significant reduction in
the execution time. The paper presents analytical models that allow for an
optimal choice of the degree of pipelining for each problem and a given
architecture. To illustrate the performance of the communication
pipelining technique, its application to the FFT computation is presented
as an example. It is shown that an optimal choice of the degree of
pipelining can achieve a reduction by a factor of d in the communication
overhead of the algorithm.
Overlapping Communication and Computation in Hypercubes
Authors: Luis Díaz de Cerio, Miguel Valero-García and
Antonio González
Reference: Second International Euro-Par Conference, pp.253-257, 26-29
August, 1996.
Abstract: This paper presents a method to derive efficient algorithms
for hypercubes. The method exploits two features of the underlying hardware:
a) the parallelism provided by the multiple communication links of each node
and b) the possibility of overlapping computations and communications, which is
a feature of machines supporting an asynchronous communication protocol. The
method can be applied to a generic class of hypercube algorithms. Many examples
of this class of algorithms are found in the literature for different problems.
The paper shows the efficiency of the method using two of these problems as an
example: FFT and Vector Add. The results show that the reduction in
communication overhead is very significant in many cases and the algorithms
produced by our method are always very close to the optimum in terms of
execution time.
Executing Algorithms with Hypercube Topology on Torus Multicomputers
Authors: Antonio González, Miguel Valero-García and Luis Díaz de Cerio
Reference: IEEE Transactions on Parallel and Distributed Systems, Vol.6,
No.8, pp.803-814, August 1995.
Abstract: Many parallel algorithms use hypercubes as the communication
topology among their processes. When such algorithms are executed on hypercube
multicomputers the communication cost is kept minimum since processes can be
allocated to processors in such a way that only communication between neighbor
processors is required. However, the scalability of hypercube multicomputers
is constrained by the fact that the interconnection cost per node increases
with the total number of nodes. From scalability point of view, meshes and
toruses are more interesting classes of interconnection topologies. This paper
focuses on the execution of algorithms with hypercube communication topology
on multicomputers with mesh or torus interconnection topologies. The proposed
approach is based on looking at different embeddings of hypercube graphs onto
mesh or torus graphs. The paper concentrates on toruses since an already known
embedding, which is called standard embedding, is optimal for meshes. In this
paper, an embedding of hypercubes onto toruses of any given dimension is
proposed. This novel embedding is called xor embedding. The paper presents a
set of performance figures for both the standard and the xor embeddings and
shows that the latter outperforms the former for any torus. In addition, it is
proven that for a one-dimensional torus (a ring) the xor embedding is optimal
in the sense that it minimizes the execution time of a class of parallel
algorithms with hypercube topology. This class of algorithms is frequently
found in real applications, such as FFT and some class of sorting algorithms.
A Study of the Communication Cost of the FFT on Torus Multicomputers
Authors: Luis Díaz de Cerio, Miguel Valero-García and Antonio González
Reference: IEEE First International Conference on Algorithms And Architectures for
Parallel Processing (ICA^3PP), pp.131-140, 19-21 April, 1995.
Abstract: In this paper, the computation of a one-dimensional FFT on a
c-dimensional torus multicomputer is analyzed. Different approaches are
proposed which differ in the way they use the interconnection network. The
first approach is based on the multidimensional index mapping technique for
the FFT computation. The second approach starts from a hypercube algorithm and
then embeds the hypercube onto the torus. The third approach reduces the
communication cost of the hypercube algorithm by pipelining the communication
operations. A novel methodology to pipeline the communication operations on a
torus is proposed. Analytical models are presented to compare the different
approaches. This comparison study shows that the best approach depends on the
number of dimensions of the torus and the communication start-up and transfer
times. The analytical models allow us to select the most efficient approach
for the available machine.
Efficient FFT on Torus Multicomputers: A Performance Study
Authors: Luis Díaz de Cerio, Miguel Valero-García and Antonio González
Reference: Proceedings of the 2nd Austrian-Hungarian Workshop on Transputer
Applications, pp.233-242, September 29-October 1, 1994.
Abstrac: In this paper, the problem of computing a one-dimensional FFT
on a c-dimensional torus multicomputer is focused. Different approaches are
proposed which differ in the way they use the interconnection network of the
torus. One of the approaches is based on the multidimensional index mapping
technique for FFT computation. A second approach is based on embedding on the
torus a hypercube algorithm for computing the radix-2 Cooley-Tukey FFT. The
third approach reduces the communication cost of the hypercube algorithm
through the communication pipelining technique. Analytical models are
presented to compare the different approaches. Finally, some performance
estimates are given to illustrate the comparison.