Alert button
Picture for Siliang Tang

Siliang Tang

Alert button

ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval

Aug 19, 2023
Kaihang Pan, Juncheng Li, Hongye Song, Hao Fei, Wei Ji, Shuo Zhang, Jun Lin, Xiaozhong Liu, Siliang Tang

Figure 1 for ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval
Figure 2 for ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval
Figure 3 for ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval
Figure 4 for ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval

Recent studies have shown that dense retrieval models, lacking dedicated training data, struggle to perform well across diverse retrieval tasks, as different retrieval tasks often entail distinct search intents. To address this challenge, in this work we introduce ControlRetriever, a generic and efficient approach with a parameter isolated architecture, capable of controlling dense retrieval models to directly perform varied retrieval tasks, harnessing the power of instructions that explicitly describe retrieval intents in natural language. Leveraging the foundation of ControlNet, which has proven powerful in text-to-image generation, ControlRetriever imbues different retrieval models with the new capacity of controllable retrieval, all while being guided by task-specific instructions. Furthermore, we propose a novel LLM guided Instruction Synthesizing and Iterative Training strategy, which iteratively tunes ControlRetriever based on extensive automatically-generated retrieval data with diverse instructions by capitalizing the advancement of large language models. Extensive experiments show that in the BEIR benchmark, with only natural language descriptions of specific retrieval intent for each task, ControlRetriever, as a unified multi-task retrieval system without task-specific tuning, significantly outperforms baseline methods designed with task-specific retrievers and also achieves state-of-the-art zero-shot performance.

Viaarxiv icon

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

Aug 15, 2023
Bosheng Qin, Wentao Ye, Qifan Yu, Siliang Tang, Yueting Zhuang

Figure 1 for Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model
Figure 2 for Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model
Figure 3 for Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model
Figure 4 for Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

The rising demand for creating lifelike avatars in the digital realm has led to an increased need for generating high-quality human videos guided by textual descriptions and poses. We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues. Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion. The crux of innovation lies in our adept utilization of the T2I diffusion model for producing video frames successively while preserving contextual relevance. We surmount the hurdles posed by maintaining human character and clothing consistency across varying poses, along with upholding the background's continuity amidst diverse human movements. To ensure consistent human appearances across the entire video, we devise an intra-frame alignment module. This module assimilates text-guided synthesized human character knowledge into the pretrained T2I diffusion model, synergizing insights from ChatGPT. For preserving background continuity, we put forth a background alignment pipeline, amalgamating insights from segment anything and image inpainting techniques. Furthermore, we propose an inter-frame alignment module that draws inspiration from an auto-regressive pipeline to augment temporal consistency between adjacent frames, where the preceding frame guides the synthesis process of the current frame. Comparisons with state-of-the-art methods demonstrate that Dancing Avatar exhibits the capacity to generate human videos with markedly superior quality, both in terms of human and background fidelity, as well as temporal coherence compared to existing state-of-the-art approaches.

* 11 pages, 3 figures 
Viaarxiv icon

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Aug 10, 2023
Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Yueting Zhuang

Figure 1 for Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions
Figure 2 for Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions
Figure 3 for Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions
Figure 4 for Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Multimodal Large Language Models (MLLMs) have recently sparked significant interest, which demonstrates emergent capabilities to serve as a general-purpose model for various vision-language tasks. However, existing methods mainly focus on limited types of instructions with a single image as visual context, which hinders the widespread availability of MLLMs. In this paper, we introduce the I4 benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions, which involve intricate image-text sequential context, covering a diverse range of scenarios (e.g., visually-rich webpages/textbooks, lecture slides, embodied dialogue). Systematic evaluation on our I4 benchmark reveals a common defect of existing methods: the Visual Prompt Generator (VPG) trained on image-captioning alignment objective tends to attend to common foreground information for captioning but struggles to extract specific information required by particular tasks. To address this issue, we propose a generic and lightweight controllable knowledge re-injection module, which utilizes the sophisticated reasoning ability of LLMs to control the VPG to conditionally extract instruction-specific visual information and re-inject it into the LLM. Further, we introduce an annotation-free cross-attention guided counterfactual image training strategy to methodically learn the proposed module by collaborating a cascade of foundation models. Enhanced by the proposed module and training strategy, we present Cheetor, a Transformer-based MLLM that can effectively handle a wide variety of interleaved vision-language instructions and achieves state-of-the-art zero-shot performance across all tasks of I4, without high-quality multimodal instruction tuning data. Cheetor also exhibits competitive performance compared with state-of-the-art instruction tuned models on MME benchmark.

Viaarxiv icon

Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion

Aug 08, 2023
Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian

Figure 1 for Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion
Figure 2 for Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion
Figure 3 for Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion
Figure 4 for Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion

Owing to the unrestricted nature of the content in the training data, large text-to-image diffusion models, such as Stable Diffusion (SD), are capable of generating images with potentially copyrighted or dangerous content based on corresponding textual concepts information. This includes specific intellectual property (IP), human faces, and various artistic styles. However, Negative Prompt, a widely used method for content removal, frequently fails to conceal this content due to inherent limitations in its inference logic. In this work, we propose a novel strategy named \textbf{Degeneration-Tuning (DT)} to shield contents of unwanted concepts from SD weights. By utilizing Scrambled Grid to reconstruct the correlation between undesired concepts and their corresponding image domain, we guide SD to generate meaningless content when such textual concepts are provided as input. As this adaptation occurs at the level of the model's weights, the SD, after DT, can be grafted onto other conditional diffusion frameworks like ControlNet to shield unwanted concepts. In addition to qualitatively showcasing the effectiveness of our DT method in protecting various types of concepts, a quantitative comparison of the SD before and after DT indicates that the DT method does not significantly impact the generative quality of other contents. The FID and IS scores of the model on COCO-30K exhibit only minor changes after DT, shifting from 12.61 and 39.20 to 13.04 and 38.25, respectively, which clearly outperforms the previous methods.

* ACM MM 2023  
Viaarxiv icon

MARIO: Model Agnostic Recipe for Improving OOD Generalization of Graph Contrastive Learning

Aug 02, 2023
Yun Zhu, Haizhou Shi, Zhenshuo Zhang, Siliang Tang

Figure 1 for MARIO: Model Agnostic Recipe for Improving OOD Generalization of Graph Contrastive Learning
Figure 2 for MARIO: Model Agnostic Recipe for Improving OOD Generalization of Graph Contrastive Learning
Figure 3 for MARIO: Model Agnostic Recipe for Improving OOD Generalization of Graph Contrastive Learning
Figure 4 for MARIO: Model Agnostic Recipe for Improving OOD Generalization of Graph Contrastive Learning

In this work, we investigate the problem of out-of-distribution (OOD) generalization for unsupervised learning methods on graph data. This scenario is particularly challenging because graph neural networks (GNNs) have been shown to be sensitive to distributional shifts, even when labels are available. To address this challenge, we propose a \underline{M}odel-\underline{A}gnostic \underline{R}ecipe for \underline{I}mproving \underline{O}OD generalizability of unsupervised graph contrastive learning methods, which we refer to as MARIO. MARIO introduces two principles aimed at developing distributional-shift-robust graph contrastive methods to overcome the limitations of existing frameworks: (i) Information Bottleneck (IB) principle for achieving generalizable representations and (ii) Invariant principle that incorporates adversarial data augmentation to obtain invariant representations. To the best of our knowledge, this is the first work that investigates the OOD generalization problem of graph contrastive learning, with a specific focus on node-level tasks. Through extensive experiments, we demonstrate that our method achieves state-of-the-art performance on the OOD test set, while maintaining comparable performance on the in-distribution test set when compared to existing approaches. The source code for our method can be found at: https://github.com/ZhuYun97/MARIO

* 21 pages, 15 figures 
Viaarxiv icon

Global Structure Knowledge-Guided Relation Extraction Method for Visually-Rich Document

May 23, 2023
Xiangnan Chen, Juncheng Li, Duo Dong, Qian Xiao, Jun Lin, Xiaozhong Liu, Siliang Tang

Figure 1 for Global Structure Knowledge-Guided Relation Extraction Method for Visually-Rich Document
Figure 2 for Global Structure Knowledge-Guided Relation Extraction Method for Visually-Rich Document
Figure 3 for Global Structure Knowledge-Guided Relation Extraction Method for Visually-Rich Document
Figure 4 for Global Structure Knowledge-Guided Relation Extraction Method for Visually-Rich Document

Visual relation extraction (VRE) aims to extract relations between entities from visuallyrich documents. Existing methods usually predict relations for each entity pair independently based on entity features but ignore the global structure information, i.e., dependencies between entity pairs. The absence of global structure information may make the model struggle to learn long-range relations and easily predict conflicted results. To alleviate such limitations, we propose a GlObal Structure knowledgeguided relation Extraction (GOSE) framework, which captures dependencies between entity pairs in an iterative manner. Given a scanned image of the document, GOSE firstly generates preliminary relation predictions on entity pairs. Secondly, it mines global structure knowledge based on prediction results of the previous iteration and further incorporates global structure knowledge into entity representations. This "generate-capture-incorporate" schema is performed multiple times so that entity representations and global structure knowledge can mutually reinforce each other. Extensive experiments show that GOSE not only outperforms previous methods on the standard fine-tuning setting but also shows promising superiority in cross-lingual learning; even yields stronger data-efficient performance in the low-resource setting.

* Work in progress 
Viaarxiv icon

Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration

May 22, 2023
Qifan Yu, Juncheng Li, Wentao Ye, Siliang Tang, Yueting Zhuang

Figure 1 for Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration
Figure 2 for Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration
Figure 3 for Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration
Figure 4 for Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration

Recent text-to-image generation models have shown promising results in generating high-fidelity photo-realistic images. In parallel, the problem of data scarcity has brought a growing interest in employing AIGC technology for high-quality data expansion. However, this paradigm requires well-designed prompt engineering that cost-less data expansion and labeling remain under-explored. Inspired by LLM's powerful capability in task guidance, we propose a new paradigm of annotated data expansion named as ChatGenImage. The core idea behind it is to leverage the complementary strengths of diverse models to establish a highly effective and user-friendly pipeline for interactive data augmentation. In this work, we extensively study how LLMs communicate with AIGC model to achieve more controllable image generation and make the first attempt to collaborate them for automatic data augmentation for a variety of downstream tasks. Finally, we present fascinating results obtained from our ChatGenImage framework and demonstrate the powerful potential of our synthetic data for systematic vision adaptation. Our codes are available at https://github.com/Yuqifan1117/Labal-Anything-Pipeline.

* 11 pages, 6 figures, technical report 
Viaarxiv icon

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

May 21, 2023
Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, Yueting Zhuang

Figure 1 for InstructVid2Vid: Controllable Video Editing with Natural Language Instructions
Figure 2 for InstructVid2Vid: Controllable Video Editing with Natural Language Instructions
Figure 3 for InstructVid2Vid: Controllable Video Editing with Natural Language Instructions
Figure 4 for InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

We present an end-to-end diffusion-based method for editing videos with human language instructions, namely $\textbf{InstructVid2Vid}$. Our approach enables the editing of input videos based on natural language instructions without any per-example fine-tuning or inversion. The proposed InstructVid2Vid model combines a pretrained image generation model, Stable Diffusion, with a conditional 3D U-Net architecture to generate time-dependent sequence of video frames. To obtain the training data, we incorporate the knowledge and expertise of different models, including ChatGPT, BLIP, and Tune-a-Video, to synthesize video-instruction triplets, which is a more cost-efficient alternative to collecting data in real-world scenarios. To improve the consistency between adjacent frames of generated videos, we propose the Frame Difference Loss, which is incorporated during the training process. During inference, we extend the classifier-free guidance to text-video input to guide the generated results, making them more related to both the input video and instruction. Experiments demonstrate that InstructVid2Vid is able to generate high-quality, temporally coherent videos and perform diverse edits, including attribute editing, change of background, and style transfer. These results highlight the versatility and effectiveness of our proposed method. Code is released in $\href{https://github.com/BrightQin/InstructVid2Vid}{InstructVid2Vid}$.

* 21 pages, 9 figures 
Viaarxiv icon