Wi-Fi gesture recognition based on Channel State Information (CSI) is challenged by high-dimensional noise and resource constraints on edge devices. Prevailing end-to-end models tightly couple feature extraction with classification, overlooking the inherent time-frequency sparsity of CSI and leading to redundancy and poor generalization. To address this, this paper proposes a lightweight feature preprocessing module--the Variational Dual-path Attention Network (VDAN). It performs structured feature refinement through frequency-domain filtering and temporal detection. Variational inference is introduced to model the uncertainty in attention weights, thereby enhancing robustness to noise. The design principles of the module are explained from the perspectives of the information bottleneck and regularization. Experiments on a public dataset demonstrate that the learned attention weights align with the physical sparse characteristics of CSI, verifying its interpretability. This work provides an efficient and explainable front-end processing solution for resource-constrained wireless sensing systems.