Alert button
Picture for Jiaya Jia

Jiaya Jia

Alert button

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Nov 28, 2023
Yanwei Li, Chengyao Wang, Jiaya Jia

In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is proved to surpass previous methods on most of video- or image-based benchmarks. Code is available https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID

* Code is available at https://github.com/dvlab-research/LLaMA-VID 
Viaarxiv icon

LLMGA: Multimodal Large Language Model based Generation Assistant

Nov 27, 2023
Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, Jiaya Jia

In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting $\&$ outpainting, and visual question answering. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during image editing. Extensive results show that LLMGA has promising generative capabilities and can enable wider applications in an interactive manner.

Viaarxiv icon

Lightweight In-Context Tuning for Multimodal Unified Models

Oct 08, 2023
Yixin Chen, Shuai Zhang, Boran Han, Jiaya Jia

Figure 1 for Lightweight In-Context Tuning for Multimodal Unified Models
Figure 2 for Lightweight In-Context Tuning for Multimodal Unified Models
Figure 3 for Lightweight In-Context Tuning for Multimodal Unified Models
Figure 4 for Lightweight In-Context Tuning for Multimodal Unified Models

In-context learning (ICL) involves reasoning from given contextual examples. As more modalities comes, this procedure is becoming more challenging as the interleaved input modalities convolutes the understanding process. This is exemplified by the observation that multimodal models often struggle to effectively extrapolate from contextual examples to perform ICL. To address these challenges, we introduce MultiModal In-conteXt Tuning (M$^2$IXT), a lightweight module to enhance the ICL capabilities of multimodal unified models. The proposed M$^2$IXT module perceives an expandable context window to incorporate various labeled examples of multiple modalities (e.g., text, image, and coordinates). It can be prepended to various multimodal unified models (e.g., OFA, Unival, LLaVA) of different architectures and trained via a mixed-tasks strategy to enable rapid few-shot adaption on multiple tasks and datasets. When tuned on as little as 50K multimodal data, M$^2$IXT can boost the few-shot ICL performance significantly (e.g., 18\% relative increase for OFA), and obtained state-of-the-art results across an array of tasks including visual question answering, image captioning, visual grounding, and visual entailment, while being considerably small in terms of model parameters (e.g., $\sim$$20\times$ smaller than Flamingo or MMICL), highlighting the flexibility and effectiveness of M$^2$IXT as a multimodal in-context learner.

* Preprint 
Viaarxiv icon

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

Sep 21, 2023
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia

Figure 1 for LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Figure 2 for LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Figure 3 for LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Figure 4 for LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.

* Code, models, dataset, and demo are available at https://github.com/dvlab-research/LongLoRA 
Viaarxiv icon

Mask-Attention-Free Transformer for 3D Instance Segmentation

Sep 04, 2023
Xin Lai, Yuhui Yuan, Ruihang Chu, Yukang Chen, Han Hu, Jiaya Jia

Figure 1 for Mask-Attention-Free Transformer for 3D Instance Segmentation
Figure 2 for Mask-Attention-Free Transformer for 3D Instance Segmentation
Figure 3 for Mask-Attention-Free Transformer for 3D Instance Segmentation
Figure 4 for Mask-Attention-Free Transformer for 3D Instance Segmentation

Recently, transformer-based methods have dominated 3D instance segmentation, where mask attention is commonly involved. Specifically, object queries are guided by the initial instance masks in the first cross-attention, and then iteratively refine themselves in a similar manner. However, we observe that the mask-attention pipeline usually leads to slow convergence due to low-recall initial instance masks. Therefore, we abandon the mask attention design and resort to an auxiliary center regression task instead. Through center regression, we effectively overcome the low-recall issue and perform cross-attention by imposing positional prior. To reach this goal, we develop a series of position-aware designs. First, we learn a spatial distribution of 3D locations as the initial position queries. They spread over the 3D space densely, and thus can easily capture the objects in a scene with a high recall. Moreover, we present relative position encoding for the cross-attention and iterative refinement for more accurate position queries. Experiments show that our approach converges 4x faster than existing work, sets a new state of the art on ScanNetv2 3D instance segmentation benchmark, and also demonstrates superior performance across various datasets. Code and models are available at https://github.com/dvlab-research/Mask-Attention-Free-Transformer.

* Accepted to ICCV 2023. Code and models are available at https://github.com/dvlab-research/Mask-Attention-Free-Transformer 
Viaarxiv icon

FocalFormer3D : Focusing on Hard Instance for 3D Object Detection

Aug 08, 2023
Yilun Chen, Zhiding Yu, Yukang Chen, Shiyi Lan, Animashree Anandkumar, Jiaya Jia, Jose Alvarez

Figure 1 for FocalFormer3D : Focusing on Hard Instance for 3D Object Detection
Figure 2 for FocalFormer3D : Focusing on Hard Instance for 3D Object Detection
Figure 3 for FocalFormer3D : Focusing on Hard Instance for 3D Object Detection
Figure 4 for FocalFormer3D : Focusing on Hard Instance for 3D Object Detection

False negatives (FN) in 3D object detection, {\em e.g.}, missing predictions of pedestrians, vehicles, or other obstacles, can lead to potentially dangerous situations in autonomous driving. While being fatal, this issue is understudied in many current 3D detection methods. In this work, we propose Hard Instance Probing (HIP), a general pipeline that identifies \textit{FN} in a multi-stage manner and guides the models to focus on excavating difficult instances. For 3D object detection, we instantiate this method as FocalFormer3D, a simple yet effective detector that excels at excavating difficult objects and improving prediction recall. FocalFormer3D features a multi-stage query generation to discover hard objects and a box-level transformer decoder to efficiently distinguish objects from massive object candidates. Experimental results on the nuScenes and Waymo datasets validate the superior performance of FocalFormer3D. The advantage leads to strong performance on both detection and tracking, in both LiDAR and multi-modal settings. Notably, FocalFormer3D achieves a 70.5 mAP and 73.9 NDS on nuScenes detection benchmark, while the nuScenes tracking benchmark shows 72.1 AMOTA, both ranking 1st place on the nuScenes LiDAR leaderboard. Our code is available at \url{https://github.com/NVlabs/FocalFormer3D}.

* Accepted by ICCV 2023 
Viaarxiv icon

LISA: Reasoning Segmentation via Large Language Model

Aug 03, 2023
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia

Figure 1 for LISA: Reasoning Segmentation via Large Language Model
Figure 2 for LISA: Reasoning Segmentation via Large Language Model
Figure 3 for LISA: Reasoning Segmentation via Large Language Model
Figure 4 for LISA: Reasoning Segmentation via Large Language Model

Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction to identify the target objects or categories before executing visual recognition tasks. Such systems lack the ability to actively reason and comprehend implicit user intentions. In this work, we propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a <SEG> token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving: 1) complex reasoning; 2) world knowledge; 3) explanatory answers; 4) multi-turn conversation. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement. Experiments show our method not only unlocks new reasoning segmentation capabilities but also proves effective in both complex reasoning segmentation and standard referring segmentation tasks. Code, models, and demo are at https://github.com/dvlab-research/LISA.

* Code, models, and demo are available at https://github.com/dvlab-research/LISA 
Viaarxiv icon

DiffComplete: Diffusion-based Generative 3D Shape Completion

Jun 28, 2023
Ruihang Chu, Enze Xie, Shentong Mo, Zhenguo Li, Matthias Nießner, Chi-Wing Fu, Jiaya Jia

Figure 1 for DiffComplete: Diffusion-based Generative 3D Shape Completion
Figure 2 for DiffComplete: Diffusion-based Generative 3D Shape Completion
Figure 3 for DiffComplete: Diffusion-based Generative 3D Shape Completion
Figure 4 for DiffComplete: Diffusion-based Generative 3D Shape Completion

We introduce a new diffusion-based approach for shape completion on 3D range scans. Compared with prior deterministic and probabilistic methods, we strike a balance between realism, multi-modality, and high fidelity. We propose DiffComplete by casting shape completion as a generative task conditioned on the incomplete shape. Our key designs are two-fold. First, we devise a hierarchical feature aggregation mechanism to inject conditional features in a spatially-consistent manner. So, we can capture both local details and broader contexts of the conditional inputs to control the shape completion. Second, we propose an occupancy-aware fusion strategy in our model to enable the completion of multiple partial shapes and introduce higher flexibility on the input conditions. DiffComplete sets a new SOTA performance (e.g., 40% decrease on l_1 error) on two large-scale 3D shape completion benchmarks. Our completed shapes not only have a realistic outlook compared with the deterministic methods but also exhibit high similarity to the ground truths compared with the probabilistic alternatives. Further, DiffComplete has strong generalizability on objects of entirely unseen classes for both synthetic and real data, eliminating the need for model re-training in various applications.

* Project Page: https://ruihangchu.com/diffcomplete.html 
Viaarxiv icon

Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract

Jun 27, 2023
Bohao Peng, Zhuotao Tian, Xiaoyang Wu, Chengyao Wang, Shu Liu, Jingyong Su, Jiaya Jia

Figure 1 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract
Figure 2 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract
Figure 3 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract
Figure 4 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract

Few-shot semantic segmentation (FSS) aims to form class-agnostic models segmenting unseen classes with only a handful of annotations. Previous methods limited to the semantic feature and prototype representation suffer from coarse segmentation granularity and train-set overfitting. In this work, we design Hierarchically Decoupled Matching Network (HDMNet) mining pixel-level support correlation based on the transformer architecture. The self-attention modules are used to assist in establishing hierarchical dense features, as a means to accomplish the cascade matching between query and support features. Moreover, we propose a matching module to reduce train-set overfitting and introduce correlation distillation leveraging semantic correspondence from coarse resolution to boost fine-grained segmentation. Our method performs decently in experiments. We achieve 50.0% mIoU on COCO dataset one-shot setting and 56.0% on five-shot segmentation, respectively. The code will be available on the project website. We hope our work can benefit broader industrial applications where novel classes with limited annotations are required to be decently identified.

* Accepted to CVPR 2023 VISION Workshop, Oral. The extended abstract of Hierarchical Dense Correlation Distillation for Few-Shot Segmentation. arXiv admin note: substantial text overlap with arXiv:2303.14652 
Viaarxiv icon