Abstract:Tail latency dominates the performance of synchronous pretraining jobs when running at very large scales. We describe a three-pronged approach: (1) a new RDMA-based transport protocol, MRC, sprays across many paths and actively load-balances between them, eliminating the issue of flow collisions (2) the use of multi-plane Clos topologies to get the benefits of high switch radix and redundancy, allowing training clusters well over 100K GPUs to be built as two-tier topologies while increasing physical redundancy, and (3) the use of static source-routing using SRv6 to allow MRC the freedom to bypass failures by itself. We describe our experiences running MRC and static SRv6 routing in production in OpenAI and Microsoft's largest training clusters, where it has been used to train the latest frontier models. We demonstrate how MRC allows AI training jobs to ride out many network failures that previously would have interrupted training.
Abstract:Neural networks have become critical components of reactive systems in various domains within computer science. Despite their excellent performance, using neural networks entails numerous risks that stem from our lack of ability to understand and reason about their behavior. Due to these risks, various formal methods have been proposed for verifying neural networks; but unfortunately, these typically struggle with scalability barriers. Recent attempts have demonstrated that abstraction-refinement approaches could play a significant role in mitigating these limitations; but these approaches can often produce networks that are so abstract, that they become unsuitable for verification. To deal with this issue, we present CEGARETTE, a novel verification mechanism where both the system and the property are abstracted and refined simultaneously. We observe that this approach allows us to produce abstract networks which are both small and sufficiently accurate, allowing for quick verification times while avoiding a large number of refinement steps. For evaluation purposes, we implemented CEGARETTE as an extension to the recently proposed CEGAR-NN framework. Our results are very promising, and demonstrate a significant improvement in performance over multiple benchmarks.




Abstract:With the increasing integration of neural networks as components in mission-critical systems, there is an increasing need to ensure that they satisfy various safety and liveness requirements. In recent years, numerous sound and complete verification methods have been proposed towards that end, but these typically suffer from severe scalability limitations. Recent work has proposed enhancing such verification techniques with abstraction-refinement capabilities, which have been shown to boost scalability: instead of verifying a large and complex network, the verifier constructs and then verifies a much smaller network, whose correctness implies the correctness of the original network. A shortcoming of such a scheme is that if verifying the smaller network fails, the verifier needs to perform a refinement step that increases the size of the network being verified, and then start verifying the new network from scratch -- effectively ``wasting'' its earlier work on verifying the smaller network. In this paper, we present an enhancement to abstraction-based verification of neural networks, by using \emph{residual reasoning}: the process of utilizing information acquired when verifying an abstract network, in order to expedite the verification of a refined network. In essence, the method allows the verifier to store information about parts of the search space in which the refined network is guaranteed to behave correctly, and allows it to focus on areas where bugs might be discovered. We implemented our approach as an extension to the Marabou verifier, and obtained promising results.