Picture for Yuexian Zou

Yuexian Zou

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Add code
Sep 14, 2024
Figure 1 for Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation
Figure 2 for Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation
Figure 3 for Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation
Viaarxiv icon

Image Conductor: Precision Control for Interactive Video Synthesis

Add code
Jun 21, 2024
Viaarxiv icon

On the Worst Prompt Performance of Large Language Models

Add code
Jun 08, 2024
Figure 1 for On the Worst Prompt Performance of Large Language Models
Figure 2 for On the Worst Prompt Performance of Large Language Models
Figure 3 for On the Worst Prompt Performance of Large Language Models
Figure 4 for On the Worst Prompt Performance of Large Language Models
Viaarxiv icon

Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning

Add code
May 31, 2024
Figure 1 for Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning
Figure 2 for Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning
Figure 3 for Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning
Figure 4 for Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning
Viaarxiv icon

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

Add code
Mar 22, 2024
Figure 1 for VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
Figure 2 for VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
Figure 3 for VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
Figure 4 for VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
Viaarxiv icon

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Add code
Mar 14, 2024
Figure 1 for VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
Figure 2 for VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
Figure 3 for VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
Figure 4 for VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
Viaarxiv icon

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

Add code
Mar 10, 2024
Figure 1 for WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs
Figure 2 for WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs
Figure 3 for WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs
Figure 4 for WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs
Viaarxiv icon

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Add code
Mar 02, 2024
Viaarxiv icon

Retrieval is Accurate Generation

Add code
Feb 29, 2024
Figure 1 for Retrieval is Accurate Generation
Figure 2 for Retrieval is Accurate Generation
Figure 3 for Retrieval is Accurate Generation
Figure 4 for Retrieval is Accurate Generation
Viaarxiv icon

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

Add code
Jan 30, 2024
Viaarxiv icon