Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hanwang Zhang

Tuning Multi-mode Token-level Prompt Alignment across Modalities

Sep 25, 2023

Dongsheng Wang, Miaoge Li, Xinyang Liu, MingSheng Xu, Bo Chen, Hanwang Zhang

Figure 1 for Tuning Multi-mode Token-level Prompt Alignment across Modalities

Figure 2 for Tuning Multi-mode Token-level Prompt Alignment across Modalities

Figure 3 for Tuning Multi-mode Token-level Prompt Alignment across Modalities

Figure 4 for Tuning Multi-mode Token-level Prompt Alignment across Modalities

Abstract:Prompt tuning pre-trained vision-language models have demonstrated significant potential in improving open-world visual concept understanding. However, prior works only primarily focus on single-mode (only one prompt for each modality) and holistic level (image or sentence) semantic alignment, which fails to capture the sample diversity, leading to sub-optimal prompt discovery. To address the limitation, we propose a multi-mode token-level tuning framework that leverages the optimal transportation to learn and align a set of prompt tokens across modalities. Specifically, we rely on two essential factors: 1) multi-mode prompts discovery, which guarantees diverse semantic representations, and 2) token-level alignment, which helps explore fine-grained similarity. Thus, the similarity can be calculated as a hierarchical transportation problem between the modality-specific sets. Extensive experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach. The qualitative analysis demonstrates that the learned prompt tokens have the ability to capture diverse visual concepts.

* In Proceedings of NeurIPS2023

Via

Access Paper or Ask Questions

Make the U in UDA Matter: Invariant Consistency Learning for Unsupervised Domain Adaptation

Sep 22, 2023

Zhongqi Yue, Hanwang Zhang, Qianru Sun

Abstract:Domain Adaptation (DA) is always challenged by the spurious correlation between domain-invariant features (e.g., class identity) and domain-specific features (e.g., environment) that does not generalize to the target domain. Unfortunately, even enriched with additional unsupervised target domains, existing Unsupervised DA (UDA) methods still suffer from it. This is because the source domain supervision only considers the target domain samples as auxiliary data (e.g., by pseudo-labeling), yet the inherent distribution in the target domain -- where the valuable de-correlation clues hide -- is disregarded. We propose to make the U in UDA matter by giving equal status to the two domains. Specifically, we learn an invariant classifier whose prediction is simultaneously consistent with the labels in the source domain and clusters in the target domain, hence the spurious correlation inconsistent in the target domain is removed. We dub our approach "Invariant CONsistency learning" (ICON). Extensive experiments show that ICON achieves the state-of-the-art performance on the classic UDA benchmarks: Office-Home and VisDA-2017, and outperforms all the conventional methods on the challenging WILDS 2.0 benchmark. Codes are in https://github.com/yue-zhongqi/ICON.

* Accepted by NeurIPS 2023

Via

Access Paper or Ask Questions

Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention

Sep 17, 2023

Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

Figure 1 for Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention

Figure 2 for Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention

Figure 3 for Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention

Figure 4 for Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention

Abstract:Many studies focus on improving pretraining or developing new backbones in text-video retrieval. However, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, spatial appearance features on action recognition or temporal object co-occurrences on video scene graph generation could induce spurious correlations. In this work, we present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips, which is the first such attempt for a text-video retrieval task, to the best of our knowledge. We first hypothesise and verify the bias on how it would affect the model illustrated with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets. Our model overpasses the baseline and SOTA on nDCG, a semantic-relevancy-focused evaluation metric which proves the bias is mitigated, as well as on the other conventional metrics.

* Accepted by the British Machine Vision Conference (BMVC) 2023. Project Page: https://buraksatar.github.io/FrameLengthBias

Via

Access Paper or Ask Questions

Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models

Aug 26, 2023

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua

Abstract:Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the intricate temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation. Inspired by human intuition, we design an innovative dynamic scene manager (dubbed as Dysen) module, which includes (step-1) extracting from input text the key actions with proper time-order arrangement, (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations, and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via in-context learning, Dysen realizes (nearly) human-level temporal dynamics understanding. Finally, the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features, integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our framework consistently outperforms prior arts with significant margins, especially in the scenario with complex actions. Project page at https://haofei.vip/Dysen-VDM

Via

Access Paper or Ask Questions

Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition

Aug 18, 2023

Xuanyu Yi, Jiajun Deng, Qianru Sun, Xian-Sheng Hua, Joo-Hwee Lim, Hanwang Zhang

Figure 1 for Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition

Figure 2 for Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition

Figure 3 for Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition

Figure 4 for Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition

Abstract:We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by using a joint prediction from a conventional 3D model and a well-trained 2D model. Surprisingly, such an ensemble, though seems trivial, has hardly been shown effective in recent 2D-3D models. We find out the crux is the less effective training for the ''joint hard samples'', which have high confidence prediction on different wrong labels, implying that the 2D and 3D models do not collaborate well. To this end, our proposed invariant training strategy, called InvJoint, does not only emphasize the training more on the hard samples, but also seeks the invariance between the conflicting 2D and 3D ambiguous predictions. InvJoint can learn more collaborative 2D and 3D representations for better ensemble. Extensive experiments on 3D shape classification with widely adopted ModelNet10/40, ScanObjectNN and Toys4K, and shape retrieval with ShapeNet-Core validate the superiority of our InvJoint.

Via

Access Paper or Ask Questions

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Aug 10, 2023

Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Yueting Zhuang

Figure 1 for Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Figure 2 for Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Figure 3 for Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Figure 4 for Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Abstract:Multimodal Large Language Models (MLLMs) have recently sparked significant interest, which demonstrates emergent capabilities to serve as a general-purpose model for various vision-language tasks. However, existing methods mainly focus on limited types of instructions with a single image as visual context, which hinders the widespread availability of MLLMs. In this paper, we introduce the I4 benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions, which involve intricate image-text sequential context, covering a diverse range of scenarios (e.g., visually-rich webpages/textbooks, lecture slides, embodied dialogue). Systematic evaluation on our I4 benchmark reveals a common defect of existing methods: the Visual Prompt Generator (VPG) trained on image-captioning alignment objective tends to attend to common foreground information for captioning but struggles to extract specific information required by particular tasks. To address this issue, we propose a generic and lightweight controllable knowledge re-injection module, which utilizes the sophisticated reasoning ability of LLMs to control the VPG to conditionally extract instruction-specific visual information and re-inject it into the LLM. Further, we introduce an annotation-free cross-attention guided counterfactual image training strategy to methodically learn the proposed module by collaborating a cascade of foundation models. Enhanced by the proposed module and training strategy, we present Cheetor, a Transformer-based MLLM that can effectively handle a wide variety of interleaved vision-language instructions and achieves state-of-the-art zero-shot performance across all tasks of I4, without high-quality multimodal instruction tuning data. Cheetor also exhibits competitive performance compared with state-of-the-art instruction tuned models on MME benchmark.

Via

Access Paper or Ask Questions

Random Boxes Are Open-world Object Detectors

Jul 17, 2023

Yanghao Wang, Zhongqi Yue, Xian-Sheng Hua, Hanwang Zhang

Figure 1 for Random Boxes Are Open-world Object Detectors

Figure 2 for Random Boxes Are Open-world Object Detectors

Figure 3 for Random Boxes Are Open-world Object Detectors

Figure 4 for Random Boxes Are Open-world Object Detectors

Abstract:We show that classifiers trained with random region proposals achieve state-of-the-art Open-world Object Detection (OWOD): they can not only maintain the accuracy of the known objects (w/ training labels), but also considerably improve the recall of unknown ones (w/o training labels). Specifically, we propose RandBox, a Fast R-CNN based architecture trained on random proposals at each training iteration, surpassing existing Faster R-CNN and Transformer based OWOD. Its effectiveness stems from the following two benefits introduced by randomness. First, as the randomization is independent of the distribution of the limited known objects, the random proposals become the instrumental variable that prevents the training from being confounded by the known objects. Second, the unbiased training encourages more proposal explorations by using our proposed matching score that does not penalize the random proposals whose prediction scores do not match the known objects. On two benchmarks: Pascal-VOC/MS-COCO and LVIS, RandBox significantly outperforms the previous state-of-the-art in all metrics. We also detail the ablations on randomization and loss designs. Codes are available at https://github.com/scuwyh2000/RandBox.

* ICCV 2023

Via

Access Paper or Ask Questions

DisCo: Disentangled Control for Referring Human Dance Generation in Real World

Jun 30, 2023

Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang

Abstract:Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions. Code, demo, video and visualization are available at: https://disco-dance.github.io/.

* Project Page: https://disco-dance.github.io/; Github Page: https://github.com/Wangt-CN/DisCo

Via

Access Paper or Ask Questions

Fast Diffusion Model

Jun 12, 2023

Zike Wu, Pan Zhou, Kenji Kawaguchi, Hanwang Zhang

Abstract:Despite their success in real data synthesis, diffusion models (DMs) often suffer from slow and costly training and sampling issues, limiting their broader applications. To mitigate this, we propose a Fast Diffusion Model (FDM) which improves the diffusion process of DMs from a stochastic optimization perspective to speed up both training and sampling. Specifically, we first find that the diffusion process of DMs accords with the stochastic optimization process of stochastic gradient descent (SGD) on a stochastic time-variant problem. Note that momentum SGD uses both the current gradient and an extra momentum, achieving more stable and faster convergence. We are inspired to introduce momentum into the diffusion process to accelerate both training and sampling. However, this comes with the challenge of deriving the noise perturbation kernel from the momentum-based diffusion process. To this end, we frame the momentum-based process as a Damped Oscillation system whose critically damped state -- the kernel solution -- avoids oscillation and thus has a faster convergence speed of the diffusion process. Empirical results show that our FDM can be applied to several popular DM frameworks, e.g. VP, VE, and EDM, and reduces their training cost by about 50% with comparable image synthesis performance on CIFAR-10, FFHQ, and AFHQv2 datasets. Moreover, FDM decreases their sampling steps by about $3\times$ to achieve similar performance under the same deterministic samplers. The code is available at https://github.com/sail-sg/FDM.

Via

Access Paper or Ask Questions

An Overview of Challenges in Egocentric Text-Video Retrieval

Jun 07, 2023

Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

Figure 1 for An Overview of Challenges in Egocentric Text-Video Retrieval

Figure 2 for An Overview of Challenges in Egocentric Text-Video Retrieval

Figure 3 for An Overview of Challenges in Egocentric Text-Video Retrieval

Figure 4 for An Overview of Challenges in Egocentric Text-Video Retrieval

Abstract:Text-video retrieval contains various challenges, including biases coming from diverse sources. We highlight some of them supported by illustrations to open a discussion. Besides, we address one of the biases, frame length bias, with a simple method which brings a very incremental but promising increase. We conclude with future directions.

* 4 pages, CVPR 2023 Joint Ego4D&EPIC Workshop, Extended Abstract

Via

Access Paper or Ask Questions