Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hanwang Zhang

Few-shot NeRF by Adaptive Rendering Loss Regularization

Oct 23, 2024

Qingshan Xu, Xuanyu Yi, Jianyao Xu, Wenbing Tao, Yew-Soon Ong, Hanwang Zhang

Figure 1 for Few-shot NeRF by Adaptive Rendering Loss Regularization

Figure 2 for Few-shot NeRF by Adaptive Rendering Loss Regularization

Figure 3 for Few-shot NeRF by Adaptive Rendering Loss Regularization

Figure 4 for Few-shot NeRF by Adaptive Rendering Loss Regularization

Abstract:Novel view synthesis with sparse inputs poses great challenges to Neural Radiance Field (NeRF). Recent works demonstrate that the frequency regularization of Positional Encoding (PE) can achieve promising results for few-shot NeRF. In this work, we reveal that there exists an inconsistency between the frequency regularization of PE and rendering loss. This prevents few-shot NeRF from synthesizing higher-quality novel views. To mitigate this inconsistency, we propose Adaptive Rendering loss regularization for few-shot NeRF, dubbed AR-NeRF. Specifically, we present a two-phase rendering supervision and an adaptive rendering loss weight learning strategy to align the frequency relationship between PE and 2D-pixel supervision. In this way, AR-NeRF can learn global structures better in the early training phase and adaptively learn local details throughout the training process. Extensive experiments show that our AR-NeRF achieves state-of-the-art performance on different datasets, including object-level and complex scenes.

* Accepted by ECCV2024

Via

Access Paper or Ask Questions

Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

Sep 30, 2024

Kaihang Pan, Zhaoyu Fan, Juncheng Li, Qifan Yu, Hao Fei, Siliang Tang, Richang Hong, Hanwang Zhang, Qianru Sun

Figure 1 for Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

Figure 2 for Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

Figure 3 for Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

Figure 4 for Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

Abstract:The swift advancement in Multimodal LLMs (MLLMs) also presents significant challenges for effective knowledge editing. Current methods, including intrinsic knowledge editing and external knowledge resorting, each possess strengths and weaknesses, struggling to balance the desired properties of reliability, generality, and locality when applied to MLLMs. In this paper, we propose UniKE, a novel multimodal editing method that establishes a unified perspective and paradigm for intrinsic knowledge editing and external knowledge resorting. Both types of knowledge are conceptualized as vectorized key-value memories, with the corresponding editing processes resembling the assimilation and accommodation phases of human cognition, conducted at the same semantic levels. Within such a unified framework, we further promote knowledge collaboration by disentangling the knowledge representations into the semantic and truthfulness spaces. Extensive experiments validate the effectiveness of our method, which ensures that the post-edit MLLM simultaneously maintains excellent reliability, generality, and locality. The code for UniKE will be available at \url{https://github.com/beepkh/UniKE}.

* Accepted by NeurIPS 2024 (Spotlight)

Via

Access Paper or Ask Questions

Instruction Tuning-free Visual Token Complement for Multimodal LLMs

Aug 09, 2024

Dongsheng Wang, Jiequan Cui, Miaoge Li, Wang Lin, Bo Chen, Hanwang Zhang

Figure 1 for Instruction Tuning-free Visual Token Complement for Multimodal LLMs

Figure 2 for Instruction Tuning-free Visual Token Complement for Multimodal LLMs

Figure 3 for Instruction Tuning-free Visual Token Complement for Multimodal LLMs

Abstract:As the open community of large language models (LLMs) matures, multimodal LLMs (MLLMs) have promised an elegant bridge between vision and language. However, current research is inherently constrained by challenges such as the need for high-quality instruction pairs and the loss of visual information in image-to-text training objectives. To this end, we propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features and thus improve response accuracy. Specifically, our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens to enrich the original visual input. Moreover, an iterative strategy is further designed to extract more visual information by iteratively using the visual selector without any additional training. Notably, the training pipeline requires no additional image-text pairs, resulting in a desired instruction tuning-free property. Both qualitative and quantitative experiments demonstrate the superiority and efficiency of our VTC.

* Accepted by ECCV2024 (20pages)

Via

Access Paper or Ask Questions

Selective Vision-Language Subspace Projection for Few-shot CLIP

Jul 26, 2024

Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, Hanwang Zhang

Figure 1 for Selective Vision-Language Subspace Projection for Few-shot CLIP

Figure 2 for Selective Vision-Language Subspace Projection for Few-shot CLIP

Figure 3 for Selective Vision-Language Subspace Projection for Few-shot CLIP

Figure 4 for Selective Vision-Language Subspace Projection for Few-shot CLIP

Abstract:Vision-language models such as CLIP are capable of mapping the different modality data into a unified feature space, enabling zero/few-shot inference by measuring the similarity of given images and texts. However, most existing methods overlook modality gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other, resulting in limited classification performance. To tackle this issue, we introduce a method called Selective Vision-Language Subspace Projection (SSP), which incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs. Specifically, our SSP framework comprises two parallel modules: a vision projector and a language projector. Both projectors utilize local image features to span the respective subspaces for image and texts, thereby projecting the image and text features into their respective subspaces to achieve alignment. Moreover, our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks. Extensive experiments on 11 datasets have demonstrated SSP's superior text-image alignment capabilities, outperforming the state-of-the-art alignment methods. The code is available at https://github.com/zhuhsingyuu/SSP

* Accepted as an Oral Paper at ACM Multimedia 2024

Via

Access Paper or Ask Questions

Visual Prompt Selection for In-Context Learning Segmentation

Jul 14, 2024

Wei Suo, Lanqing Lai, Mengyang Sun, Hanwang Zhang, Peng Wang, Yanning Zhang

Abstract:As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level. Recently, inspired by In-Context Learning (ICL), several generalist segmentation frameworks have been proposed, providing a promising paradigm for segmenting specific objects. However, existing works mostly ignore the value of visual prompts or simply apply similarity sorting to select contextual examples. In this paper, we focus on rethinking and improving the example selection strategy. By comprehensive comparisons, we first demonstrate that ICL-based segmentation models are sensitive to different contexts. Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation. Based on the above insights, we propose a new stepwise context search method. Different from previous works, we construct a small yet rich candidate pool and adaptively search the well-matched contexts. More importantly, this method effectively reduces the annotation cost by compacting the search space. Extensive experiments show that our method is an effective strategy for selecting examples and enhancing segmentation performance.

* Accept by ECCV2024

Via

Access Paper or Ask Questions

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Jun 16, 2024

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao

Figure 1 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Figure 2 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Figure 3 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Figure 4 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Abstract:With the advance of diffusion models, today's video generation has achieved impressive quality. But generating temporal consistent long videos is still challenging. A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip. However, existing approaches all involve bidirectional computations, which restricts the receptive context of each autoregression step, and results in the model lacking long-term dependencies. Inspired from the huge success of large language models (LLMs) and following GPT (generative pre-trained transformer), we bring causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames. For Causal Generation, we introduce causal temporal attention into VDM, which forces each generated frame to depend on its previous frames. For Frame as Prompt, we inject the conditional frames by concatenating them with noisy frames (frames to be generated) along the temporal axis. Consequently, we present Video Diffusion GPT (ViD-GPT). Based on the two key designs, in each autoregression step, it is able to acquire long-term context from prompting frames concatenated by all previously generated frames. Additionally, we bring the kv-cache mechanism to VDMs, which eliminates the redundant computation from overlapped frames, significantly boosting the inference speed. Extensive experiments demonstrate that our ViD-GPT achieves state-of-the-art performance both quantitatively and qualitatively on long video generation. Code will be available at https://github.com/Dawn-LX/Causal-VideoGen.

* Code will be available at https://github.com/Dawn-LX/Causal-VideoGen

Via

Access Paper or Ask Questions

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Jun 13, 2024

Yucheng Han, Rui Wang, Chi Zhang, Juntao Hu, Pei Cheng, Bin Fu, Hanwang Zhang

Abstract:Recent advancements in image generation have enabled the creation of high-quality images from text conditions. However, when facing multi-modal conditions, such as text combined with reference appearances, existing methods struggle to balance multiple conditions effectively, typically showing a preference for one modality over others. To address this challenge, we introduce EMMA, a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA seamlessly incorporates additional modalities alongside text to guide image generation through an innovative Multi-modal Feature Connector design, which effectively integrates textual and supplementary modal information using a special attention mechanism. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts. This interesting property facilitates easy adaptation to different existing frameworks, making EMMA a flexible and effective tool for producing personalized and context-aware images and even videos. Additionally, we introduce a strategy to assemble learned EMMA modules to produce images conditioned on multiple modalities simultaneously, eliminating the need for additional training with mixed multi-modal prompts. Extensive experiments demonstrate the effectiveness of EMMA in maintaining high fidelity and detail in generated images, showcasing its potential as a robust solution for advanced multi-modal conditional image generation tasks.

* https://tencentqqgylab.github.io/EMMA

Via

Access Paper or Ask Questions

MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Jun 10, 2024

Xuanyu Yi, Zike Wu, Qiuhong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Shuicheng Yan, Xinchao Wang, Hanwang Zhang

Figure 1 for MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Figure 2 for MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Figure 3 for MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Figure 4 for MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Abstract:Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-view inconsistency and blurred textures. We attribute this to the compromise of multi-view information propagation in favor of adopting powerful yet computationally intensive architectures (\eg, Transformers). To address this issue, we introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM). Our Gaussian reconstructor propagates causal context containing multi-view information for cross-view self-refinement while generating a long sequence of Gaussians for fine-detail modeling with linear complexity. With off-the-shelf multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts. Extensive experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0.1\times$ of the model size.

Via

Access Paper or Ask Questions

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Jun 07, 2024

Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan

Figure 1 for Towards Semantic Equivalence of Tokenization in Multimodal LLM

Figure 2 for Towards Semantic Equivalence of Tokenization in Multimodal LLM

Figure 3 for Towards Semantic Equivalence of Tokenization in Multimodal LLM

Figure 4 for Towards Semantic Equivalence of Tokenization in Multimodal LLM

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at https://chocowu.github.io/SeTok-web/.

* Technical Report. The project page: https://chocowu.github.io/SeTok-web/

Via

Access Paper or Ask Questions

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

May 27, 2024

Kai Wang, Yukun Zhou, Mingjia Shi, Zhihang Yuan, Yuzhang Shang, Xiaojiang Peng, Hanwang Zhang, Yang You

Figure 1 for A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Figure 2 for A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Figure 3 for A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Figure 4 for A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Abstract:Training diffusion models is always a computation-intensive task. In this paper, we introduce a novel speed-up method for diffusion model training, called, which is based on a closer look at time steps. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergence areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training. To address this, we design an asymmetric sampling strategy that reduces the frequency of steps from the convergence area while increasing the sampling probability for steps from other areas. Additionally, we propose a weighting strategy to emphasize the importance of time steps with rapid-change process increments. As a plug-and-play and architecture-agnostic approach, SpeeD consistently achieves 3-times acceleration across various diffusion architectures, datasets, and tasks. Notably, due to its simple design, our approach significantly reduces the cost of diffusion model training with minimal overhead. Our research enables more researchers to train diffusion models at a lower cost.

Via

Access Paper or Ask Questions