Beam prediction is critical for reducing beam-training overhead in millimeter-wave (mmWave) systems, especially in high-mobility vehicular scenarios. This paper presents a BEV-Fusion based framework that unifies camera, LiDAR, radar, and GPS modalities in a shared bird's-eye-view (BEV) representation for spatially consistent multi-modal fusion. Unlike priorapproaches that fuse globally pooled one-dimensional features, the proposed method performs fusion in BEV space to preservecross-modal geometric structure and visual semantic density. A learned camera-to-BEV module based on cross-attention is adopted to generate BEV-aligned visual features without relying on precise camera calibration, and a temporal transformer is used to aggregate five-step sequential observations for motion-aware beam prediction. Experiments on the DeepSense 6G benchmark show that BEV-Fusion achieves approximately 87% distance- based accuracy (DBA) on scenarios 32, 33 and 34, outperforming the TransFuser baseline. These results indicate that BEV-space fusion provides an effective spatial abstraction for sensing-assisted beam prediction.