Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael Houle

Accelerating AI Development with Cyber Arenas

Sep 10, 2025

William Cashman, Chasen Milner, Michael Houle, Michael Jones, Hayden Jananthan, Jeremy Kepner, Peter Michaleas, Alex Pentland

Abstract:AI development requires high fidelity testing environments to effectively transition from the laboratory to operations. The flexibility offered by cyber arenas presents a novel opportunity to test new artificial intelligence (AI) capabilities with users. Cyber arenas are designed to expose end-users to real-world situations and must rapidly incorporate evolving capabilities to meet their core objectives. To explore this concept the MIT/IEEE/Amazon Graph Challenge Anonymized Network Sensor was deployed in a cyber arena during a National Guard exercise.

* 2 pages, 1 figure, 7 references, accepted to IEEE HPEC 2025

Via

Access Paper or Ask Questions

Benchmarking network fabrics for data distributed training of deep neural networks

Aug 18, 2020

Siddharth Samsi, Andrew Prout, Michael Jones, Andrew Kirby, Bill Arcand, Bill Bergeron, David Bestor, Chansup Byun, Vijay Gadepally, Michael Houle(+9 more)

Figure 1 for Benchmarking network fabrics for data distributed training of deep neural networks

Figure 2 for Benchmarking network fabrics for data distributed training of deep neural networks

Figure 3 for Benchmarking network fabrics for data distributed training of deep neural networks

Figure 4 for Benchmarking network fabrics for data distributed training of deep neural networks

Abstract:Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.

* Accepted for publication at IEEE HPEC 2020

Via

Access Paper or Ask Questions