Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chen Change Loy

Control Color: Multimodal Diffusion-based Interactive Image Colorization

Feb 16, 2024

Zhexin Liang, Zhaochen Li, Shangchen Zhou, Chongyi Li, Chen Change Loy

Figure 1 for Control Color: Multimodal Diffusion-based Interactive Image Colorization

Figure 2 for Control Color: Multimodal Diffusion-based Interactive Image Colorization

Figure 3 for Control Color: Multimodal Diffusion-based Interactive Image Colorization

Figure 4 for Control Color: Multimodal Diffusion-based Interactive Image Colorization

Abstract:Despite the existence of numerous colorization methods, several limitations still exist, such as lack of user interaction, inflexibility in local colorization, unnatural color rendering, insufficient color variation, and color overflow. To solve these issues, we introduce Control Color (CtrlColor), a multi-modal colorization method that leverages the pre-trained Stable Diffusion (SD) model, offering promising capabilities in highly controllable interactive image colorization. While several diffusion-based methods have been proposed, supporting colorization in multiple modalities remains non-trivial. In this study, we aim to tackle both unconditional and conditional image colorization (text prompts, strokes, exemplars) and address color overflow and incorrect color within a unified framework. Specifically, we present an effective way to encode user strokes to enable precise local color manipulation and employ a practical way to constrain the color distribution similar to exemplars. Apart from accepting text prompts as conditions, these designs add versatility to our approach. We also introduce a novel module based on self-attention and a content-guided deformable autoencoder to address the long-standing issues of color overflow and inaccurate coloring. Extensive comparisons show that our model outperforms state-of-the-art image colorization methods both qualitatively and quantitatively.

* Project Page: https://zhexinliang.github.io/Control_Color/; Demo Video: https://youtu.be/tSCwA-srl8Q

Via

Access Paper or Ask Questions

OMG-Seg: Is One Model Good Enough For All Segmentation?

Jan 18, 2024

Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, Chen Change Loy

Figure 1 for OMG-Seg: Is One Model Good Enough For All Segmentation?

Figure 2 for OMG-Seg: Is One Model Good Enough For All Segmentation?

Figure 3 for OMG-Seg: Is One Model Good Enough For All Segmentation?

Figure 4 for OMG-Seg: Is One Model Good Enough For All Segmentation?

Abstract:In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.

* Project Page: https://lxtgh.github.io/project/omg_seg/

Via

Access Paper or Ask Questions

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Jan 18, 2024

Jianzong Wu, Xiangtai Li, Chenyang Si, Shangchen Zhou, Jingkang Yang, Jiangning Zhang, Yining Li, Kai Chen, Yunhai Tong, Ziwei Liu(+1 more)

Figure 1 for Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Figure 2 for Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Figure 3 for Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Figure 4 for Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Abstract:We introduce a new task -- language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks, a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We will make datasets, code, and models publicly available.

* Project Page: https://jianzongwu.github.io/projects/rovi

Via

Access Paper or Ask Questions

Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

Jan 05, 2024

Haobo Yuan, Xiangtai Li, Chong Zhou, Yining Li, Kai Chen, Chen Change Loy

Abstract:The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, while CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the naive baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes.

* Project page: https://www.mmlab-ntu.com/project/ovsam

Via

Access Paper or Ask Questions

Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing

Dec 19, 2023

Yushi Lan, Feitong Tan, Di Qiu, Qiangeng Xu, Kyle Genova, Zeng Huang, Sean Fanello, Rohit Pandey, Thomas Funkhouser, Chen Change Loy(+1 more)

Figure 1 for Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing

Figure 2 for Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing

Figure 3 for Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing

Figure 4 for Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing

Abstract:We present a novel framework for generating photorealistic 3D human head and subsequently manipulating and reposing them with remarkable flexibility. The proposed approach leverages an implicit function representation of 3D human heads, employing 3D Gaussians anchored on a parametric face model. To enhance representational capabilities and encode spatial information, we embed a lightweight tri-plane payload within each Gaussian rather than directly storing color and opacity. Additionally, we parameterize the Gaussians in a 2D UV space via a 3DMM, enabling effective utilization of the diffusion model for 3D head avatar generation. Our method facilitates the creation of diverse and realistic 3D human heads with fine-grained editing over facial features and expressions. Extensive experiments demonstrate the effectiveness of our method.

* project webpage: https://nirvanalan.github.io/projects/gaussian3diff/

Via

Access Paper or Ask Questions

CLIM: Contrastive Language-Image Mosaic for Region Representation

Dec 19, 2023

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Wentao Liu, Chen Change Loy

Abstract:Detecting objects accurately from a large or open vocabulary necessitates the vision-language alignment on region representations. However, learning such a region-text alignment by obtaining high-quality box annotations with text labels or descriptions is expensive and infeasible. In contrast, collecting image-text pairs is simpler but lacks precise object location information to associate regions with texts. In this paper, we propose a novel approach called Contrastive Language-Image Mosaic (CLIM), which leverages large-scale image-text pairs effectively for aligning region and text representations. CLIM combines multiple images into a mosaicked image and treats each image as a `pseudo region'. The feature of each pseudo region is extracted and trained to be similar to the corresponding text embedding while dissimilar from others by a contrastive loss, enabling the model to learn the region-text alignment without costly box annotations. As a generally applicable approach, CLIM consistently improves different open-vocabulary object detection methods that use caption supervision. Furthermore, CLIM can effectively enhance the region representation of vision-language models, thus providing stronger backbones for open-vocabulary object detectors. Our experimental results demonstrate that CLIM improves different baseline open-vocabulary object detectors by a large margin on both OV-COCO and OV-LVIS benchmarks. The code is available at https://github.com/wusize/CLIM.

Via

Access Paper or Ask Questions

EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM

Dec 11, 2023

Chong Zhou, Xiangtai Li, Chen Change Loy, Bo Dai

Abstract:This paper presents EdgeSAM, an accelerated variant of the Segment Anything Model (SAM), optimized for efficient execution on edge devices with minimal compromise in performance. Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture, better suited for edge devices. We carefully benchmark various distillation strategies and demonstrate that task-agnostic encoder distillation fails to capture the full knowledge embodied in SAM. To overcome this bottleneck, we include both the prompt encoder and mask decoder in the distillation process, with box and point prompts in the loop, so that the distilled model can accurately capture the intricate dynamics between user input and mask generation. To mitigate dataset bias issues stemming from point prompt distillation, we incorporate a lightweight module within the encoder. EdgeSAM achieves a 40-fold speed increase compared to the original SAM, and it also outperforms MobileSAM, being 14 times as fast when deployed on edge devices while enhancing the mIoUs on COCO and LVIS by 2.3 and 3.2 respectively. It is also the first SAM variant that can run at over 30 FPS on an iPhone 14. Code and models are available at https://github.com/chongzhou96/EdgeSAM.

* Project page: https://mmlab-ntu.github.io/project/edgesam/

Via

Access Paper or Ask Questions

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Dec 11, 2023

Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, Chen Change Loy

Abstract:Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

* Equal contributions from first two authors. Project page: https://shangchenzhou.com/projects/upscale-a-video/

Via

Access Paper or Ask Questions

Digital Life Project: Autonomous 3D Characters with Social Intelligence

Dec 07, 2023

Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin(+13 more)

Figure 1 for Digital Life Project: Autonomous 3D Characters with Social Intelligence

Figure 2 for Digital Life Project: Autonomous 3D Characters with Social Intelligence

Figure 3 for Digital Life Project: Autonomous 3D Characters with Social Intelligence

Figure 4 for Digital Life Project: Autonomous 3D Characters with Social Intelligence

Abstract:In this work, we present Digital Life Project, a framework utilizing language as the universal medium to build autonomous 3D characters, who are capable of engaging in social interactions and expressing with articulated body motions, thereby simulating life in a digital environment. Our framework comprises two primary components: 1) SocioMind: a meticulously crafted digital brain that models personalities with systematic few-shot exemplars, incorporates a reflection process based on psychology principles, and emulates autonomy by initiating dialogue topics; 2) MoMat-MoGen: a text-driven motion synthesis paradigm for controlling the character's digital body. It integrates motion matching, a proven industry technique to ensure motion quality, with cutting-edge advancements in motion generation for diversity. Extensive experiments demonstrate that each module achieves state-of-the-art performance in its respective domain. Collectively, they enable virtual characters to initiate and sustain dialogues autonomously, while evolving their socio-psychological states. Concurrently, these characters can perform contextually relevant bodily movements. Additionally, a motion captioning module further allows the virtual character to recognize and appropriately respond to human players' actions. Homepage: https://digital-life-project.com/

* Homepage: https://digital-life-project.com/

Via

Access Paper or Ask Questions

VideoBooth: Diffusion-based Video Generation with Image Prompts

Dec 01, 2023

Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu

Abstract:Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.

* Project page: https://vchitect.github.io/VideoBooth-project/

Via

Access Paper or Ask Questions