Timely and robust influenza incidence forecasting is critical for public health decision-making. To address this, we present MAESTRO, a Multi-modal Adaptive Ensemble for Spectro-Temporal Robust Optimization. MAESTRO achieves robustness by adaptively fusing multi-modal inputs-including surveillance, web search trends, and meteorological data-and leveraging a comprehensive spectro-temporal architecture. The model first decomposes time series into seasonal and trend components. These are then processed through a hybrid feature enhancement pipeline combining Transformer-based encoders, a Mamba state-space model for long-range dependencies, multi-scale temporal convolutions, and a frequency-domain analysis module. A cross-channel attention mechanism further integrates information across the different data modalities. Finally, a temporal projection head performs sequence-to-sequence forecasting, with an optional estimator to quantify prediction uncertainty. Evaluated on over 11 years of Hong Kong influenza data (excluding the COVID-19 period), MAESTRO shows strong competitive performance, demonstrating a superior model fit and relative accuracy, achieving a state-of-the-art R-square of 0.956. Extensive ablations confirm the significant contributions of both multi-modal fusion and the spectro-temporal components. Our modular and reproducible pipeline is made publicly available to facilitate deployment and extension to other regions and pathogens.Our publicly available pipeline presents a powerful, unified framework, demonstrating the critical synergy of advanced spectro-temporal modeling and multi-modal data fusion for robust epidemiological forecasting.