Some Research Reports


Embedding Hypercubes onto Rings and Toruses

Authors: Antonio González, Miguel Valero-García and Luis Díaz de Cerio

Reference: UPC-DAC-93-01

Abstract: Many parallel algorithms use hypercubes as the communication topology among processes. When such algorithms are executed on hypercube multicomputers the communication cost is kept minimum since processes can be allocated to processors in such a way that only communication between neighbor processors is required. However, the scalability of hypercube multicomputers is constrained by the fact that the interconnection cost per node increases with the total number of nodes. From the point of view of scalability, meshes and toruses are a more interesting class of interconnection topologies. This paper focuses on the execution of algorithms with hypercube communication topology on multicomputers with mesh or torus interconnection topologies. The proposed approach is based on looking at different embeddings of hypercube graphs onto mesh or torus graphs. The paper concentrates on toruses since an already known embedding, called standard embedding is optimal for meshes. In this paper we propose an embedding of hypercubes onto toruses of any given dimension. We will refer to this embedding as xor embedding. The paper presents a set of performance figures for both embeddings and shows that the xor embedding outperforms the standard embedding for any torus. In addition, we prove that for a one-dimensional torus (a ring) the xor embedding is optimal in the sense that it minimizes the execution time of a class of parallel algorithms with hypercube topology. This class of algorithms is frequently found in real applications, such as FFT and some class of sorting algorithms.

Modelado de primitivas Blas y evaluación de la descomposición QR sobre un procesador del Convex C-3400

Authors:Luis Díaz de Cerio, Miguel Valero-García

Reference: UPC-CEPBA-93-07

Abstract: En este trabajo se propone un método de modelado de primitivas BLAS sobre la arquitectura vectorial de un procesador del CONVEX C-3400. El objetivo es modelar y evaluar el rendimiento de algoritmos expresados en términos de dichas primitivas. El estudio está dirigido al caso particular de algoritmos para la descomposión QR codificados en cada nivel diferente de primitivas BLAS. El método de modelado de las primitivas BLAS está basado en medidas experimentales y un ajuste por mínimos cuadrados. De esta manera evitamos excesivas dependencias con la arquitectura. El modelado de los algoritmos de la descomposición QR esta basado a su vez en los modelos de las primitivas BLAS previamente desarrollados. La evaluación del rendimiento de los diferentes algoritmos se lleva a cabo a través de resultados experimentales y modelados. La finalidad de esta evaluación es comparar los diferentes algoritmos entre si, dar muestra de la validez del método desarrollado, y determinar el algoritmo óptimo para la descomposición QR sobre un procesador del Convex C-3400.

A Study of the Communication Cost of the FFT on Torus Multicomputers

Authors: Luis Díaz de Cerio, Miguel Valero-García and Antonio González

Reference: UPC-DAC-94-25

Abstract: In this paper, the computation of a one-dimensional FFT on a c-dimensional torus multicomputer is analyzed. Different approaches are proposed which differ in the way they use the interconnection network. The first approach is based on the multidimensional index mapping technique for the FFT computation. The second approach starts from a hypercube algorithm and then embeds the hypercube onto the torus. The third approach reduces the communication cost of the hypercube algorithm by pipelining the communication operations. A novel methodology to pipeline the communication operations on a torus is proposed. Analytical models are presented to compare the different approaches. This comparison study shows that the best approach depends on the dimensionality of the torus and the communication start-up and transfer times. The analytical models allow us to select the most efficient approach for the available machine.

Communication Pipelining in Hypercubes

Authors: Luis Díaz de Cerio, Antonio González and Miguel Valero-García

Reference: UPC-DAC-95-14

Abstract: Many parallel algorithms exhibit a hypercube communication topology. Such algorithms can easily be executed on a multicomputer with a hypercube interconnection topology. However, in most cases these parallel algorithms only make use of a small fraction of the interconnection bandwidth offered by the multicomputer. In particular, each processor of a hypercube multicomputer is connected to d different neighbors by d different links. Nevertheless, hypercube algorithms usually do not use more than one of these d links at the same time. This paper presents a technique called communication pipelining that enables a more efficient use of the interconnection network and, in consequence, a significant reduction in the execution time. This technique is based on a transformation of the original algorithm. The resulting equivalent code makes use of several links of each node simultaneously. Given a particular problem and a particular architecture, the degree of pipelining to be applied is a design parameter that must be decided when transforming the original algorithm. The paper presents analytical models that allow for an optimal choice of the degree of pipelining for each problem and a given architecture. To illustrate the performance of the communication pipelining technique, its application to the FFT computation is presented as an example. It is shown that an optimal choice of the degree of pipelining can achieve a reduction by a factor of d in the communication overhead of the algorithm.

Overlapping Communication and Computation in Hypercubes

Authors: Luis Díaz de Cerio, Miguel Valero-García and Antonio González

Reference: UPC-DAC-96-02

Abstract: This paper presents a method to derive efficient algorithms for hypercubes. The method exploits two features of the underlying hardware: a) the parallelism provided by the multiple communication links of each node and b) the possibility of overlapping computations and communications which is a feature of machines supporting an asynchronous communication protocol. The method can be applied to a generic class of hypercube algorithms whose distinguishing features are quite frequent in common algorithms for hypercubes. Many examples of this class of algorithms are found in the literature for different problems. The paper shows the efficiency of the method using two of these problems as an example: FFT and Vector Add. The results show that the reduction in communication overhead is very significant in many cases. They also show that the algorithms produced by our method are always very close to the optimum in terms of execution time.

A Systematic Approach to Develop Efficient Complete Exchange Algorithms for Meshes and Tori

Authors: Luis Díaz de Cerio, Miguel Valero-García and Antonio González

Reference: UPC-DAC-97-29

Abstract: Many authors have considered the design of complete exchange algorithms for a variety of multicomputer models, including hypercubes, multidimensional meshes and tori with different port and message switching models. Frequently, algorithms for a given multicomputer architecture cannot be used (or are not efficient) for a different architecture. This paper presents a method which allows the systematic design of complete exchange algorithms for a wide range of multicomputer architectures, including the cases usually considered in the literature and some other architectures that may be interesting in the future. Performance figures obtained by analytical models show that algorithms obtained through the proposed method are efficient for almost all the multicomputer models under consideration and outperform the best known algorithms for a significant range of the problem and system parameters.

Divide-and-Conquer Algorithms on Two-Dimensional Meshes

Authors:Miguel Valero-García, Antonio González, Luis Díaz de Cerio, Dolors Royo

Reference: UPC-DAC-97-30

Abstract: The Reflecting and Growing mappings have been proposed to map parallel divide-and-conquer algorithms onto two-dimensional meshes. The performance properties of these mappings have been analysed under the assumption that the parallel algorithm is initiated always at the same fixed node of the mesh. In this scenario, the Reflecting mapping is optimal for meshes with wormhole routing and the Growing mapping is very close to the optimal for meshes with store-and-forward routing. In this paper we consider a more general scenario in which the parallel divide-and-conquer algorithm can be started from an arbitrary node of the mesh. It is shown that, in this new scenario, while the Reflecting mapping is still optimal for a wormhole mesh, the performance of the Growing mapping in store-and-forward meshes decreases very quickly when the mesh size increases. An alternative approach is proposed which, besides being simpler than both the Reflecting and Growing mappings, is also optimal for wormhole meshes and better than the Growing mapping for store-and-forward meshes.

CALMANT: Un Método Sistemático para la Ejecución de Algoritmos con Topología Hipercubo en Mallas y Toros

Authors:Luis Díaz de Cerio, Miguel Valero-García, Antonio González, Dolors Royo

Reference: UPC-DAC-97-36

Abstract: En este trabajo presentamos el método CALMANT (CC-cube Algorithms on Meshes And Tori) como un método sistemático para la ejecución de un cierto tipo de algoritmos que denominamos algoritmos CC-cubo sobre mallas y toros de varias dimensiones. Es muy frecuente encontrar algoritmos CC-cubo en la literatura (FFT, complete exchange, cálculo de valores y vectores propios, etc.) pero la aplicación directa de estos algoritmos no permite explotar eficientemente el ancho de banda que nos ofrecen las redes de interconexión en malla y toro. El método CALMANT permite reorganizar los cálculos y las comunicaciones de los algoritmos CC-cubo de manera que la eficiencia aumente notablemente. La importancia de este método no sólo radica en el aumento de la eficiencia sino que además puede ser aplicado de forma sistemática sobre diferentes tipos de arquitectura.

Hypercube Algorithms on Mesh Connected Multicomputers

Authors:Luis Díaz de Cerio, Miguel Valero-García, Antonio González

Reference: UPC-DAC-2000-1

Abstract: A new methodology named CALMANT (CC-cube Algorithms on Meshes and Tori) for mapping a kind of algorithms that we call CC-cube algorithms onto multicomputers with hypercube, mesh or torus interconnection topology is proposed. This methodology is suitable when the initial problem, can be expressed as a set of processes that communicate through a hypercube topology (a CC-cube algorithm). There are many important algorithms that fit into the CC-cube type. CALMANT is based on three different techniques: a) the standard embedding to assign the processes of the algorithm to the nodes of the mesh multicomputer; b) the communication pipelining technique to increase the level of communication parallelism inherent in the CC-cube algorithms; and c) optimal message-routing algorithms proposed in this work in order to avoid conflicts and minimizing in this way the communication time. Although CALMANT is proposed for multicomputers with different interconnection network topologies, this paper only focuses on the particular case of meshes.

CALMANT: Un Método Sistemático para la Ejecución de Algoritmos con Topología Hipercubo en Multicomputadores

Authors:Luis Díaz de Cerio, Antonio González, Miguel Valero-García

Reference: UPC-DAC-2000-13

Abstract: En este trabajo presentamos el método CALMANT (CC-cube Algorithms on Meshes And Tori) como un método sistemático para la ejecución de un cierto tipo de algoritmos, que denominamos algoritmos CC-cubo, sobre mallas y toros de varias dimensiones. Es muy frecuente encontrar algoritmos CC-cubo en la literatura (FFT, complete exchange, cálculo de valores y vectores propios, etc.) pero la aplicación directa de estos algoritmos no permite explotar eficientemente el ancho de banda que nos ofrecen las redes de interconexión en malla y toro. El método CALMANT permite reorganizar los cálculos y las comunicaciones de los algoritmos CC-cubo de manera que la eficiencia aumente notablemente. La importancia de este método no sólo radica en el aumento de la eficiencia sino que además puede ser aplicado de forma sistemática sobre diferentes tipos de arquitectura.