fusion.To ensure stable training, we further introduce a \textit{Modality Weight Consistency Loss}, which regularizes the fused representation to stay close to unimodal embeddings proportionally to their assigned weights. Our method is model-agnostic and can be integrated into existing MLLMs such as BLIP-2 and LLaVA. Experimental results on VQA, image-text retrieval, and captioning tasks show that DMS significantly improves both clean and robust performance, especially under modality corruption or dropout conditions. This work provides a general and effective mechanism to enable instance-aware and robustness-enhanced multimodal modeling.
https://github.com/xid32/SoundMind.
https://github.com/Rows21/KMax-Mamba.
https://github.com/llllly26/ComplexBench-Edit.