Picture for Shentong Mo

Shentong Mo

Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm

Add code
Aug 05, 2025
Viaarxiv icon

DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap

Add code
Mar 15, 2025
Figure 1 for DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap
Figure 2 for DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap
Figure 3 for DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap
Figure 4 for DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap
Viaarxiv icon

The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

Add code
Dec 23, 2024
Viaarxiv icon

Modality-Inconsistent Continual Learning of Multimodal Large Language Models

Add code
Dec 17, 2024
Figure 1 for Modality-Inconsistent Continual Learning of Multimodal Large Language Models
Figure 2 for Modality-Inconsistent Continual Learning of Multimodal Large Language Models
Figure 3 for Modality-Inconsistent Continual Learning of Multimodal Large Language Models
Figure 4 for Modality-Inconsistent Continual Learning of Multimodal Large Language Models
Viaarxiv icon

Continual Audio-Visual Sound Separation

Add code
Nov 05, 2024
Figure 1 for Continual Audio-Visual Sound Separation
Figure 2 for Continual Audio-Visual Sound Separation
Figure 3 for Continual Audio-Visual Sound Separation
Figure 4 for Continual Audio-Visual Sound Separation
Viaarxiv icon

Aligning Audio-Visual Joint Representations with an Agentic Workflow

Add code
Oct 31, 2024
Figure 1 for Aligning Audio-Visual Joint Representations with an Agentic Workflow
Figure 2 for Aligning Audio-Visual Joint Representations with an Agentic Workflow
Figure 3 for Aligning Audio-Visual Joint Representations with an Agentic Workflow
Figure 4 for Aligning Audio-Visual Joint Representations with an Agentic Workflow
Viaarxiv icon

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

Add code
Oct 25, 2024
Figure 1 for Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Figure 2 for Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Figure 3 for Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Figure 4 for Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Viaarxiv icon

Rethinking Positive Pairs in Contrastive Learning

Add code
Oct 23, 2024
Figure 1 for Rethinking Positive Pairs in Contrastive Learning
Figure 2 for Rethinking Positive Pairs in Contrastive Learning
Figure 3 for Rethinking Positive Pairs in Contrastive Learning
Figure 4 for Rethinking Positive Pairs in Contrastive Learning
Viaarxiv icon

Multi-scale Multi-instance Visual Sound Localization and Segmentation

Add code
Aug 31, 2024
Viaarxiv icon

MultiMed: Massively Multimodal and Multitask Medical Understanding

Add code
Aug 22, 2024
Figure 1 for MultiMed: Massively Multimodal and Multitask Medical Understanding
Figure 2 for MultiMed: Massively Multimodal and Multitask Medical Understanding
Figure 3 for MultiMed: Massively Multimodal and Multitask Medical Understanding
Figure 4 for MultiMed: Massively Multimodal and Multitask Medical Understanding
Viaarxiv icon