Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.
Abstract:In the domain of multimodal intent recognition (MIR), the objective is to recognize human intent by integrating a variety of modalities, such as language text, body gestures, and tones. However, existing approaches face difficulties adequately capturing the intrinsic connections between the modalities and overlooking the corresponding semantic representations of intent. To address these limitations, we present the Anchor-based Mul- timodal Embedding with Semantic Synchronization (A-MESS) framework. We first design an Anchor-based Multimodal Embed- ding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs. Furthermore, we develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimizes the pro- cess by synchronizing multimodal representation with label de- scriptions produced by the large language model. Comprehensive experiments indicate that our A-MESS achieves state-of-the-art and provides substantial insight into multimodal representation and downstream tasks.