Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matteo Turisini

Bine Trees: Enhancing Collective Operations by Optimizing Communication Locality

Aug 24, 2025

Daniele De Sensi, Saverio Pasqualoni, Lorenzo Piarulli, Tommaso Bonato, Seydou Ba, Matteo Turisini, Jens Domke, Torsten Hoefler

Abstract:Communication locality plays a key role in the performance of collective operations on large HPC systems, especially on oversubscribed networks where groups of nodes are fully connected internally but sparsely linked through global connections. We present Bine (binomial negabinary) trees, a family of collective algorithms that improve communication locality. Bine trees maintain the generality of binomial trees and butterflies while cutting global-link traffic by up to 33%. We implement eight Bine-based collectives and evaluate them on four large-scale supercomputers with Dragonfly, Dragonfly+, oversubscribed fat-tree, and torus topologies, achieving up to 5x speedups and consistent reductions in global-link traffic across different vector sizes and node counts.

* Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '25) (2025)

Via

Access Paper or Ask Questions

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Aug 26, 2024

Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi(+4 more)

Figure 1 for Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Figure 2 for Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Figure 3 for Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Figure 4 for Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Abstract:Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.

* Published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '24) (2024)

Via

Access Paper or Ask Questions