Using multimodal sensory data can enhance communications systems by reducing the overhead and latency in beam training. However, processing such data incurs high computational complexity, and continuous sensing results in significant power and bandwidth consumption. This gives rise to a tradeoff between the (multimodal) sensing data acquisition rate and communications performance. In this work, we develop a constrained multimodal sensing-aided communications framework where dynamic sensing and beamforming are performed under a sensing budget. Specifically, we formulate an optimization problem that maximizes the average received signal-to-noise ratio (SNR) of user equipment, subject to constraints on the average number of sensing actions and power budget. Using the Saleh-Valenzuela mmWave channel model, we construct the channel primarily based on position information obtained via multimodal sensing. Stricter sensing constraints reduce the availability of position data, leading to degraded channel estimation and thus lower performance. We apply Lyapunov optimization to solve the problem and derive a dynamic sensing and beamforming algorithm. Numerical evaluations on the DeepSense and Raymobtime datasets show that halving sensing times leads to only up to 7.7% loss in average SNR.