Abstract:Automatic Modulation Recognition (AMR) is a key enabling technology for cognitive radio and intelligent spectrum management in next-generation wireless systems. However, current deep learning-based AMR methods predominantly rely on static multi-scale fusion strategies, which lack the flexibility to adapt to the highly dynamic temporal variations of modulation signals. To address this limitation, we propose MoEformer, an adaptive Multi-Scale Mixture-of-Experts Transformer network that directly processes I/Q signals to preserve their temporal and phase structures. Specifically, MoEformer constructs multi scale expert views through temporal resampling, employs an input-dependent gating mechanism for dynamic expert fusion, and integrates Rotary Position Embeddings (RoPE) within Transformer encoders to capture both local and global tem poral dependencies. Comprehensive evaluations on three widely adopted benchmarks (RadioML2016.10a, RadioML2016.10b, and RadioML2018.01A) demonstrate that MoEformer outperforms the competitive baselines, achieving superior average recognition accuracies of 63.74%, 66.24%, and 64.22%, respectively. In addition, the proposed method strikes an optimal trade-off between recognition performance and model complexity.
Abstract:The paradigm of Multimodal Large Language Models (MLLMs) offers a promising blueprint for advancing the electromagnetic (EM) domain. However, prevailing approaches often deviate from the native MLLM paradigm, instead using task-specific or pipelined architectures that lead to fundamental limitations in model performance and generalization. Fully realizing the MLLM potential in EM domain requires overcoming three main challenges: (1) Data. The scarcity of high-quality datasets with paired EM signals and descriptive text annotations used for MLLMs pre-training; (2) Benchmark. The absence of comprehensive benchmarks to systematically evaluate and compare the performance of models on EM signal-to-text tasks; (3) Model. A critical fragility in low Signal-to-Noise Ratio (SNR) environments, where critical signal features can be obscured, leading to significant performance degradation. To address these challenges, we introduce a tripartite contribution to establish a foundation for MLLMs in the EM domain. First, to overcome data scarcity, we construct and release EM-100k, a large-scale dataset comprising over 100,000 EM signal-text pairs. Second, to enable rigorous and standardized evaluation, we propose EM-Bench, the most comprehensive benchmark featuring diverse downstream tasks spanning from perception to reasoning. Finally, to tackle the core modeling challenge, we present MERLIN, a novel training framework designed not only to align low-level signal representations with high-level semantic text, but also to explicitly enhance model robustness and performance in challenging low-SNR environments. Comprehensive experiments validate our method, showing that MERLIN is state-of-the-art in the EM-Bench and exhibits remarkable robustness in low-SNR settings.