Abstract:Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions rather than rare exceptions. Public operational evidence from production training clusters, however, remains scarce. This technical report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The cluster operates within a cross-organizational environment in which five parties (SKT, Upstage, Lablup, NVIDIA Korea, and VAST Data) share a unified monitoring pipeline. This arrangement enabled joint diagnosis of a 60-node-scale storage I/O bottleneck that did not appear at 2-4-node scale, a production-scale phenomenon no single team could isolate alone. Drawing on a months-long pre-training campaign, we perform three quantitative analyses yielding four findings. First, statistical analysis over 751 Prometheus metrics and 10 XID-identified GPU failures achieves a 10/10 detection rate (2/10 pre-XID) at ~0.84 false positives per day. No single metric is consistently dominant across failure types, motivating a multi-signal detection strategy. Second, profiling 523 checkpoint events along the GPU VRAM to NFS path attributes the "bandwidth paradox" (1.4-10.4% utilization of 200 Gbps RoCE) to saturation of the 128-slot NFS RPC layer. Third, multi-node failure response shows concentrated exclusions (top 3 of 63 nodes account for >50% of all exclusions) and an auto-retry chain success rate of 33.3% over 12 chains (73 attempts), 2.7x the 12.5% manual recovery rate; the median retry interval is 11 min (IQR 10-11). All analyses are grounded in production infrastructure providing session-level workload management, GPU-centric scheduling, and unified observability.




Abstract:Previous work has shown that the addition of haptic feedback to the hands can improve awareness of tool-tissue interactions and enhance performance of teleoperated tasks in robot-assisted minimally invasive surgery. However, hand-based haptic feedback occludes direct interaction with the manipulanda of surgeon console in teleoperated surgical robots. We propose relocating haptic feedback to the wrist using a wearable haptic device so that haptic feedback mechanisms do not need to be integrated into the manipulanda. However, it is unknown if such feedback will be effective, given that it is not co-located with the finger movements used for manipulation. To test if relocated haptic feedback improves force application during teleoperated tasks using da Vinci Research Kit (dVRK) surgical robot, participants learned to palpate a phantom tissue to desired forces. A soft pneumatic wrist-worn haptic device with an anchoring system renders tool-tissue interaction forces to the wrist of the user. Participants performed the palpation task with and without wrist-worn haptic feedback and were evaluated for the accuracy of applied forces. Participants demonstrated statistically significant lower force error when wrist-worn haptic feedback was provided. Participants also performed the palpation task with longer movement times when provided wrist-worn haptic feedback, indicating that the haptic feedback may have caused participants to operate at a different point in the speed-accuracy tradeoff curve.