Abstract:Purpose: Volumetric ultrafast ultrasound produces massive datasets with high frame rates, dense reconstruction grids, and large channel counts. Beamforming computational demands limit research throughput and prevent real-time applications in emerging modalities such as elastography, functional neuroimaging, and microscopy. Approach: We developed mach, an open-source, GPU-accelerated beamformer with a highly optimized delay-and-sum CUDA kernel and an accessible Python interface. mach uses a hybrid delay computation strategy that substantially reduces memory overhead compared to fully precomputed approaches. The CUDA implementation optimizes memory layout for coalesced access and reuses delay computations across frames via shared memory. We benchmarked mach on the PyMUST rotating disk dataset and validated numerical accuracy against existing open-source beamformers. Results: mach processes 1.1 trillion points per second on a consumer-grade GPU, achieving $>$10$\times$ faster performance than existing open-source GPU beamformers. On the PyMUST rotating disk benchmark, mach completes reconstruction in 0.23~ms, 6$\times$ faster than the acoustic round-trip time to the imaging depth. Validation against other beamformers confirms numerical accuracy with errors below $-60$~dB for Power Doppler and $-120$~dB for B-mode. Conclusions: mach achieves 1.1 trillion points per second throughput, enabling real-time 3D ultrafast ultrasound reconstruction for the first time on consumer-grade hardware. By eliminating the beamforming bottleneck, mach enables real-time applications such as 3D functional neuroimaging, intraoperative guidance, and ultrasound localization microscopy. mach is freely available at https://github.com/Forest-Neurotech/mach