Some Research Reports
Embedding Hypercubes onto Rings and Toruses
Authors: Antonio González, Miguel Valero-García and Luis Díaz de Cerio
Reference: UPC-DAC-93-01
Abstract: Many parallel algorithms use hypercubes as the communication
topology among processes. When such algorithms are executed on hypercube
multicomputers the communication cost is kept minimum since processes can be
allocated to processors in such a way that only communication between neighbor
processors is required. However, the scalability of hypercube multicomputers
is constrained by the fact that the interconnection cost per node increases
with the total number of nodes. From the point of view of scalability, meshes
and toruses are a more interesting class of interconnection topologies. This
paper focuses on the execution of algorithms with hypercube communication
topology on multicomputers with mesh or torus interconnection topologies. The
proposed approach is based on looking at different embeddings of hypercube
graphs onto mesh or torus graphs. The paper concentrates on toruses since an
already known embedding, called standard embedding is optimal for meshes. In
this paper we propose an embedding of hypercubes onto toruses of any given
dimension. We will refer to this embedding as xor embedding. The paper
presents a set of performance figures for both embeddings and shows that the
xor embedding outperforms the standard embedding for any torus. In addition,
we prove that for a one-dimensional torus (a ring) the xor embedding is
optimal in the sense that it minimizes the execution time of a class of
parallel algorithms with hypercube topology. This class of algorithms is
frequently found in real applications, such as FFT and some class of sorting
algorithms.
Modelado de primitivas Blas y evaluación de la descomposición QR sobre un procesador del Convex C-3400
Authors:Luis Díaz de Cerio, Miguel Valero-García
Reference: UPC-CEPBA-93-07
Abstract: En este trabajo se propone un método de modelado de
primitivas BLAS sobre la arquitectura vectorial de un procesador del CONVEX
C-3400. El objetivo es modelar y evaluar el rendimiento de algoritmos
expresados en términos de dichas primitivas. El estudio está
dirigido al caso particular de algoritmos para la descomposión QR
codificados en cada nivel diferente de primitivas BLAS. El método de
modelado de las primitivas BLAS está basado en medidas experimentales
y un ajuste por mínimos cuadrados. De esta manera evitamos excesivas
dependencias con la arquitectura. El modelado de los algoritmos de la
descomposición QR esta basado a su vez en los modelos de las primitivas
BLAS previamente desarrollados. La evaluación del rendimiento de los
diferentes algoritmos se lleva a cabo a través de resultados
experimentales y modelados. La finalidad de esta evaluación es comparar
los diferentes algoritmos entre si, dar muestra de la validez del método
desarrollado, y determinar el algoritmo óptimo para la
descomposición QR sobre un procesador del Convex C-3400.
A Study of the Communication Cost of the FFT on Torus Multicomputers
Authors: Luis Díaz de Cerio, Miguel Valero-García and Antonio González
Reference: UPC-DAC-94-25
Abstract: In this paper, the computation of a one-dimensional FFT on a
c-dimensional torus multicomputer is analyzed. Different approaches are
proposed which differ in the way they use the interconnection network. The
first approach is based on the multidimensional index mapping technique for
the FFT computation. The second approach starts from a hypercube algorithm and
then embeds the hypercube onto the torus. The third approach reduces the
communication cost of the hypercube algorithm by pipelining the communication
operations. A novel methodology to pipeline the communication operations on a
torus is proposed. Analytical models are presented to compare the different
approaches. This comparison study shows that the best approach depends on the
dimensionality of the torus and the communication start-up and transfer times.
The analytical models allow us to select the most efficient approach for the
available machine.
Communication Pipelining in Hypercubes
Authors: Luis Díaz de Cerio, Antonio González and Miguel Valero-García
Reference: UPC-DAC-95-14
Abstract: Many parallel algorithms exhibit a hypercube communication
topology. Such algorithms can easily be executed on a multicomputer with a
hypercube interconnection topology. However, in most cases these parallel
algorithms only make use of a small fraction of the interconnection bandwidth
offered by the multicomputer. In particular, each processor of a hypercube
multicomputer is connected to d different neighbors by d different links.
Nevertheless, hypercube algorithms usually do not use more than one of these d
links at the same time. This paper presents a technique called communication
pipelining that enables a more efficient use of the interconnection network
and, in consequence, a significant reduction in the execution time. This
technique is based on a transformation of the original algorithm. The
resulting equivalent code makes use of several links of each node
simultaneously. Given a particular problem and a particular architecture, the
degree of pipelining to be applied is a design parameter that must be decided
when transforming the original algorithm. The paper presents analytical models
that allow for an optimal choice of the degree of pipelining for each problem
and a given architecture. To illustrate the performance of the communication
pipelining technique, its application to the FFT computation is presented as
an example. It is shown that an optimal choice of the degree of pipelining can
achieve a reduction by a factor of d in the communication overhead of the
algorithm.
Overlapping Communication and Computation in Hypercubes
Authors: Luis Díaz de Cerio, Miguel Valero-García and Antonio González
Reference: UPC-DAC-96-02
Abstract: This paper presents a method to derive efficient algorithms
for hypercubes. The method exploits two features of the underlying hardware:
a) the parallelism provided by the multiple communication links of each node
and b) the possibility of overlapping computations and communications which is
a feature of machines supporting an asynchronous communication protocol. The
method can be applied to a generic class of hypercube algorithms whose
distinguishing features are quite frequent in common algorithms for hypercubes.
Many examples of this class of algorithms are found in the literature for
different problems. The paper shows the efficiency of the method using two of
these problems as an example: FFT and Vector Add. The results show that the
reduction in communication overhead is very significant in many cases. They
also show that the algorithms produced by our method are always very close to
the optimum in terms of execution time.
A Systematic Approach to Develop Efficient Complete Exchange Algorithms for Meshes and Tori
Authors: Luis Díaz de Cerio, Miguel Valero-García and Antonio González
Reference: UPC-DAC-97-29
Abstract: Many authors have considered the design of complete exchange
algorithms for a variety of multicomputer models, including hypercubes,
multidimensional meshes and tori with different port and message switching
models. Frequently, algorithms for a given multicomputer architecture cannot be
used (or are not efficient) for a different architecture. This paper presents a
method which allows the systematic design of complete exchange algorithms for a
wide range of multicomputer architectures, including the cases usually
considered in the literature and some other architectures that may be
interesting in the future. Performance figures obtained by analytical models
show that algorithms obtained through the proposed method are efficient for
almost all the multicomputer models under consideration and outperform the best
known algorithms for a significant range of the problem and system parameters.
Divide-and-Conquer Algorithms on Two-Dimensional Meshes
Authors:Miguel Valero-García, Antonio González, Luis Díaz de Cerio, Dolors Royo
Reference: UPC-DAC-97-30
Abstract: The Reflecting and Growing mappings have been proposed to map
parallel divide-and-conquer algorithms onto two-dimensional meshes. The
performance properties of these mappings have been analysed under the
assumption that the parallel algorithm is initiated always at the same fixed
node of the mesh. In this scenario, the Reflecting mapping is optimal for
meshes with wormhole routing and the Growing mapping is very close to the
optimal for meshes with store-and-forward routing. In this paper we consider a
more general scenario in which the parallel divide-and-conquer algorithm can be
started from an arbitrary node of the mesh. It is shown that, in this new
scenario, while the Reflecting mapping is still optimal for a wormhole mesh,
the performance of the Growing mapping in store-and-forward meshes decreases
very quickly when the mesh size increases. An alternative approach is proposed
which, besides being simpler than both the Reflecting and Growing mappings, is
also optimal for wormhole meshes and better than the Growing mapping for
store-and-forward meshes.
CALMANT: Un Método Sistemático para la Ejecución de Algoritmos con Topología Hipercubo en Mallas y Toros
Authors:Luis Díaz de Cerio, Miguel Valero-García, Antonio González, Dolors Royo
Reference: UPC-DAC-97-36
Abstract: En este trabajo presentamos el método CALMANT (CC-cube
Algorithms on Meshes And Tori) como un método sistemático para la ejecución de
un cierto tipo de algoritmos que denominamos algoritmos CC-cubo sobre mallas y
toros de varias dimensiones. Es muy frecuente encontrar algoritmos CC-cubo en
la literatura (FFT, complete exchange, cálculo de valores y vectores propios,
etc.) pero la aplicación directa de estos algoritmos no permite explotar
eficientemente el ancho de banda que nos ofrecen las redes de interconexión en
malla y toro. El método CALMANT permite reorganizar los cálculos y las
comunicaciones de los algoritmos CC-cubo de manera que la eficiencia aumente
notablemente. La importancia de este método no sólo radica en el aumento de la
eficiencia sino que además puede ser aplicado de forma sistemática sobre
diferentes tipos de arquitectura.
Hypercube Algorithms on Mesh Connected Multicomputers
Authors:Luis Díaz de Cerio, Miguel Valero-García, Antonio González
Reference: UPC-DAC-2000-1
Abstract: A new methodology named CALMANT (CC-cube Algorithms on Meshes
and Tori) for mapping a kind of algorithms that we call CC-cube algorithms onto
multicomputers with hypercube, mesh or torus interconnection topology is
proposed. This methodology is suitable when the initial problem, can be
expressed as a set of processes that communicate through a hypercube topology
(a CC-cube algorithm). There are many important algorithms that fit into the
CC-cube type. CALMANT is based on three different techniques: a) the standard
embedding to assign the processes of the algorithm to the nodes of the mesh
multicomputer; b) the communication pipelining technique to increase the level
of communication parallelism inherent in the CC-cube algorithms; and c) optimal
message-routing algorithms proposed in this work in order to avoid conflicts
and minimizing in this way the communication time. Although CALMANT is
proposed for multicomputers with different interconnection network topologies,
this paper only focuses on the particular case of meshes.
CALMANT: Un Método
Sistemático para la Ejecución de Algoritmos con Topología Hipercubo en Multicomputadores
Authors:Luis Díaz de Cerio, Antonio González, Miguel Valero-García
Reference: UPC-DAC-2000-13
Abstract: En este trabajo presentamos el método CALMANT (CC-cube Algorithms on Meshes
And Tori) como un método sistemático para la ejecución de un cierto tipo de
algoritmos, que denominamos algoritmos CC-cubo, sobre mallas y toros de varias
dimensiones. Es muy frecuente encontrar algoritmos CC-cubo en la literatura
(FFT, complete exchange, cálculo de valores y vectores propios, etc.) pero
la aplicación directa de estos algoritmos no permite explotar eficientemente el ancho de banda
que nos ofrecen las redes de interconexión en malla y toro. El método CALMANT
permite reorganizar los cálculos y las comunicaciones de los algoritmos CC-cubo
de manera que la eficiencia aumente notablemente. La importancia de este método
no sólo radica en el aumento de la eficiencia sino que además puede ser
aplicado de forma sistemática sobre diferentes tipos de arquitectura.