Picture for Yuexian Zou

Yuexian Zou

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Add code
Oct 28, 2022
Figure 1 for DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Figure 2 for DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Figure 3 for DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Figure 4 for DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Viaarxiv icon

Video Referring Expression Comprehension via Transformer with Content-aware Query

Add code
Oct 06, 2022
Figure 1 for Video Referring Expression Comprehension via Transformer with Content-aware Query
Figure 2 for Video Referring Expression Comprehension via Transformer with Content-aware Query
Figure 3 for Video Referring Expression Comprehension via Transformer with Content-aware Query
Figure 4 for Video Referring Expression Comprehension via Transformer with Content-aware Query
Viaarxiv icon

Correspondence Matters for Video Referring Expression Comprehension

Add code
Jul 21, 2022
Figure 1 for Correspondence Matters for Video Referring Expression Comprehension
Figure 2 for Correspondence Matters for Video Referring Expression Comprehension
Figure 3 for Correspondence Matters for Video Referring Expression Comprehension
Figure 4 for Correspondence Matters for Video Referring Expression Comprehension
Viaarxiv icon

LocVTP: Video-Text Pre-training for Temporal Localization

Add code
Jul 21, 2022
Figure 1 for LocVTP: Video-Text Pre-training for Temporal Localization
Figure 2 for LocVTP: Video-Text Pre-training for Temporal Localization
Figure 3 for LocVTP: Video-Text Pre-training for Temporal Localization
Figure 4 for LocVTP: Video-Text Pre-training for Temporal Localization
Viaarxiv icon

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

Add code
Jul 20, 2022
Figure 1 for Diffsound: Discrete Diffusion Model for Text-to-sound Generation
Figure 2 for Diffsound: Discrete Diffusion Model for Text-to-sound Generation
Figure 3 for Diffsound: Discrete Diffusion Model for Text-to-sound Generation
Figure 4 for Diffsound: Discrete Diffusion Model for Text-to-sound Generation
Viaarxiv icon

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Add code
Jun 05, 2022
Figure 1 for LAE: Language-Aware Encoder for Monolingual and Multilingual ASR
Figure 2 for LAE: Language-Aware Encoder for Monolingual and Multilingual ASR
Figure 3 for LAE: Language-Aware Encoder for Monolingual and Multilingual ASR
Figure 4 for LAE: Language-Aware Encoder for Monolingual and Multilingual ASR
Viaarxiv icon

Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

Add code
May 03, 2022
Figure 1 for Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention
Figure 2 for Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention
Figure 3 for Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention
Figure 4 for Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention
Viaarxiv icon

End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

Add code
Apr 29, 2022
Figure 1 for End-to-end Spoken Conversational Question Answering: Task, Dataset and Model
Figure 2 for End-to-end Spoken Conversational Question Answering: Task, Dataset and Model
Figure 3 for End-to-end Spoken Conversational Question Answering: Task, Dataset and Model
Figure 4 for End-to-end Spoken Conversational Question Answering: Task, Dataset and Model
Viaarxiv icon

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

Add code
Apr 15, 2022
Figure 1 for Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction
Figure 2 for Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction
Figure 3 for Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction
Figure 4 for Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction
Viaarxiv icon

RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

Add code
Apr 05, 2022
Figure 1 for RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection
Figure 2 for RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection
Figure 3 for RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection
Figure 4 for RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection
Viaarxiv icon