Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chengjie Wang

Decision Boundary-aware Knowledge Consolidation Generates Better Instance-Incremental Learner

Jun 05, 2024

Qiang Nie, Weifu Fu, Yuhuan Lin, Jialin Li, Yifeng Zhou, Yong Liu, Lei Zhu, Chengjie Wang

Abstract:Instance-incremental learning (IIL) focuses on learning continually with data of the same classes. Compared to class-incremental learning (CIL), the IIL is seldom explored because IIL suffers less from catastrophic forgetting (CF). However, besides retaining knowledge, in real-world deployment scenarios where the class space is always predefined, continual and cost-effective model promotion with the potential unavailability of previous data is a more essential demand. Therefore, we first define a new and more practical IIL setting as promoting the model's performance besides resisting CF with only new observations. Two issues have to be tackled in the new IIL setting: 1) the notorious catastrophic forgetting because of no access to old data, and 2) broadening the existing decision boundary to new observations because of concept drift. To tackle these problems, our key insight is to moderately broaden the decision boundary to fail cases while retain old boundary. Hence, we propose a novel decision boundary-aware distillation method with consolidating knowledge to teacher to ease the student learning new knowledge. We also establish the benchmarks on existing datasets Cifar-100 and ImageNet. Notably, extensive experiments demonstrate that the teacher model can be a better incremental learner than the student model, which overturns previous knowledge distillation-based methods treating student as the main role.

* 14 pages

Via

Access Paper or Ask Questions

M3DM-NR: RGB-3D Noisy-Resistant Industrial Anomaly Detection via Multimodal Denoising

Jun 04, 2024

Chengjie Wang, Haokun Zhu, Jinlong Peng, Yue Wang, Ran Yi, Yunsheng Wu, Lizhuang Ma, Jiangning Zhang

Abstract:Existing industrial anomaly detection methods primarily concentrate on unsupervised learning with pristine RGB images. Yet, both RGB and 3D data are crucial for anomaly detection, and the datasets are seldom completely clean in practical scenarios. To address above challenges, this paper initially delves into the RGB-3D multi-modal noisy anomaly detection, proposing a novel noise-resistant M3DM-NR framework to leveraging strong multi-modal discriminative capabilities of CLIP. M3DM-NR consists of three stages: Stage-I introduces the Suspected References Selection module to filter a few normal samples from the training dataset, using the multimodal features extracted by the Initial Feature Extraction, and a Suspected Anomaly Map Computation module to generate a suspected anomaly map to focus on abnormal regions as reference. Stage-II uses the suspected anomaly maps of the reference samples as reference, and inputs image, point cloud, and text information to achieve denoising of the training samples through intra-modal comparison and multi-scale aggregation operations. Finally, Stage-III proposes the Point Feature Alignment, Unsupervised Feature Fusion, Noise Discriminative Coreset Selection, and Decision Layer Fusion modules to learn the pattern of the training dataset, enabling anomaly detection and segmentation while filtering out noise. Extensive experiments show that M3DM-NR outperforms state-of-the-art methods in 3D-RGB multi-modal noisy anomaly detection.

Via

Access Paper or Ask Questions

NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

May 31, 2024

Kai Wu, Boyuan Jiang, Zhengkai Jiang, Qingdong He, Donghao Luo, Shengzhi Wang, Qingwen Liu, Chengjie Wang

Figure 1 for NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Figure 2 for NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Figure 3 for NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Figure 4 for NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Abstract:Multimodal large language models (MLLMs) contribute a powerful mechanism to understanding visual information building on large language models. However, MLLMs are notorious for suffering from hallucinations, especially when generating lengthy, detailed descriptions for images. Our analysis reveals that hallucinations stem from the inherent summarization mechanism of large language models, leading to excessive dependence on linguistic tokens while neglecting vision information. In this paper, we propose NoiseBoost, a broadly applicable and simple method for alleviating hallucinations for MLLMs through the integration of noise feature perturbations. Noise perturbation acts as a regularizer, facilitating a balanced distribution of attention weights among visual and linguistic tokens. Despite its simplicity, NoiseBoost consistently enhances the performance of MLLMs across common training strategies, including supervised fine-tuning and reinforcement learning. Further, NoiseBoost pioneerly enables semi-supervised learning for MLLMs, unleashing the power of unlabeled data. Comprehensive experiments demonstrate that NoiseBoost improves dense caption accuracy by 8.1% with human evaluation and achieves comparable results with 50% of the data by mining unlabeled data. Code and models are available at https://kaiwu5.github.io/noiseboost.

* 14 pages, 5 figures with supplementary material

Via

Access Paper or Ask Questions

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

May 28, 2024

Qilin Wang, Zhengkai Jiang, Chengming Xu, Jiangning Zhang, Yabiao Wang, Xinyi Zhang, Yun Cao, Weijian Cao, Chengjie Wang, Yanwei Fu

Figure 1 for VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

Figure 2 for VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

Figure 3 for VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

Figure 4 for VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

Abstract:Human image animation involves generating a video from a static image by following a specified pose sequence. Current approaches typically adopt a multi-stage pipeline that separately learns appearance and motion, which often leads to appearance degradation and temporal inconsistencies. To address these issues, we propose VividPose, an innovative end-to-end pipeline based on Stable Video Diffusion (SVD) that ensures superior temporal stability. To enhance the retention of human identity, we propose an identity-aware appearance controller that integrates additional facial information without compromising other appearance details such as clothing texture and background. This approach ensures that the generated videos maintain high fidelity to the identity of human subject, preserving key facial features across various poses. To accommodate diverse human body shapes and hand movements, we introduce a geometry-aware pose controller that utilizes both dense rendering maps from SMPL-X and sparse skeleton maps. This enables accurate alignment of pose and shape in the generated videos, providing a robust framework capable of handling a wide range of body shapes and dynamic hand movements. Extensive qualitative and quantitative experiments on the UBCFashion and TikTok benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset. Codes and models will be available.

Via

Access Paper or Ask Questions

AdapNet: Adaptive Noise-Based Network for Low-Quality Image Retrieval

May 28, 2024

Sihe Zhang, Qingdong He, Jinlong Peng, Yuxi Li, Zhengkai Jiang, Jiafu Wu, Mingmin Chi, Yabiao Wang, Chengjie Wang

Figure 1 for AdapNet: Adaptive Noise-Based Network for Low-Quality Image Retrieval

Figure 2 for AdapNet: Adaptive Noise-Based Network for Low-Quality Image Retrieval

Figure 3 for AdapNet: Adaptive Noise-Based Network for Low-Quality Image Retrieval

Figure 4 for AdapNet: Adaptive Noise-Based Network for Low-Quality Image Retrieval

Abstract:Image retrieval aims to identify visually similar images within a database using a given query image. Traditional methods typically employ both global and local features extracted from images for matching, and may also apply re-ranking techniques to enhance accuracy. However, these methods often fail to account for the noise present in query images, which can stem from natural or human-induced factors, thereby negatively impacting retrieval performance. To mitigate this issue, we introduce a novel setting for low-quality image retrieval, and propose an Adaptive Noise-Based Network (AdapNet) to learn robust abstract representations. Specifically, we devise a quality compensation block trained to compensate for various low-quality factors in input images. Besides, we introduce an innovative adaptive noise-based loss function, which dynamically adjusts its focus on the gradient in accordance with image quality, thereby augmenting the learning of unknown noisy samples during training and enhancing intra-class compactness. To assess the performance, we construct two datasets with low-quality queries, which is built by applying various types of noise on clean query images on the standard Revisited Oxford and Revisited Paris datasets. Comprehensive experimental results illustrate that AdapNet surpasses state-of-the-art methods on the Noise Revisited Oxford and Noise Revisited Paris benchmarks, while maintaining competitive performance on high-quality datasets. The code and constructed datasets will be made available.

Via

Access Paper or Ask Questions

FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

May 24, 2024

Ke Fan, Junshu Tang, Weijian Cao, Ran Yi, Moran Li, Jingyu Gong, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Lizhuang Ma

Figure 1 for FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

Figure 2 for FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

Figure 3 for FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

Figure 4 for FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

Abstract:Text-to-motion synthesis is a crucial task in computer vision. Existing methods are limited in their universality, as they are tailored for single-person or two-person scenarios and can not be applied to generate motions for more individuals. To achieve the number-free motion synthesis, this paper reconsiders motion generation and proposes to unify the single and multi-person motion by the conditional motion distribution. Furthermore, a generation module and an interaction module are designed for our FreeMotion framework to decouple the process of conditional motion generation and finally support the number-free motion synthesis. Besides, based on our framework, the current single-person motion spatial control method could be seamlessly integrated, achieving precise control of multi-person motion. Extensive experiments demonstrate the superior performance of our method and our capability to infer single and multi-human motions simultaneously.

Via

Access Paper or Ask Questions

StyleMaster: Towards Flexible Stylized Image Generation with Diffusion Models

May 24, 2024

Chengming Xu, Kai Hu, Donghao Luo, Jiangning Zhang, Wei Li, Yanhao Ge, Chengjie Wang

Figure 1 for StyleMaster: Towards Flexible Stylized Image Generation with Diffusion Models

Figure 2 for StyleMaster: Towards Flexible Stylized Image Generation with Diffusion Models

Figure 3 for StyleMaster: Towards Flexible Stylized Image Generation with Diffusion Models

Figure 4 for StyleMaster: Towards Flexible Stylized Image Generation with Diffusion Models

Abstract:Stylized Text-to-Image Generation (STIG) aims to generate images based on text prompts and style reference images. We in this paper propose a novel framework dubbed as StyleMaster for this task by leveraging pretrained Stable Diffusion (SD), which tries to solve the previous problems such as insufficient style and inconsistent semantics. The enhancement lies in two novel module, namely multi-source style embedder and dynamic attention adapter. In order to provide SD with better style embeddings, we propose the multi-source style embedder considers both global and local level visual information along with textual one, which provide both complementary style-related and semantic-related knowledge. Additionally, aiming for better balance between the adaptor capacity and semantic control, the proposed dynamic attention adapter is applied to the diffusion UNet in which adaptation weights are dynamically calculated based on the style embeddings. Two objective functions are introduced to optimize the model together with denoising loss, which can further enhance semantic and style consistency. Extensive experiments demonstrate the superiority of StyleMaster over existing methods, rendering images with variable target styles while successfully maintaining the semantic information from the text prompts.

Via

Access Paper or Ask Questions

PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning

May 24, 2024

Qingdong He, Jiangning Zhang, Jinlong Peng, Haoyang He, Yabiao Wang, Chengjie Wang

Abstract:Transformers have revolutionized the point cloud learning task, but the quadratic complexity hinders its extension to long sequence and makes a burden on limited computational resources. The recent advent of RWKV, a fresh breed of deep sequence models, has shown immense potential for sequence modeling in NLP tasks. In this paper, we present PointRWKV, a model of linear complexity derived from the RWKV model in the NLP field with necessary modifications for point cloud learning tasks. Specifically, taking the embedded point patches as input, we first propose to explore the global processing capabilities within PointRWKV blocks using modified multi-headed matrix-valued states and a dynamic attention recurrence mechanism. To extract local geometric features simultaneously, we design a parallel branch to encode the point cloud efficiently in a fixed radius near-neighbors graph with a graph stabilizer. Furthermore, we design PointRWKV as a multi-scale framework for hierarchical feature learning of 3D point clouds, facilitating various downstream tasks. Extensive experiments on different point cloud learning tasks show our proposed PointRWKV outperforms the transformer- and mamba-based counterparts, while significantly saving about 46\% FLOPs, demonstrating the potential option for constructing foundational 3D models.

Via

Access Paper or Ask Questions

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

May 21, 2024

Yue Han, Junwei Zhu, Keke He, Xu Chen, Yanhao Ge, Wei Li, Xiangtai Li, Jiangning Zhang, Chengjie Wang, Yong Liu

Figure 1 for Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Figure 2 for Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Figure 3 for Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Figure 4 for Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Abstract:Current face reenactment and swapping methods mainly rely on GAN frameworks, but recent focus has shifted to pre-trained diffusion models for their superior generation capabilities. However, training these models is resource-intensive, and the results have not yet achieved satisfactory performance levels. To address this issue, we introduce Face-Adapter, an efficient and effective adapter designed for high-precision and high-fidelity face editing for pre-trained diffusion models. We observe that both face reenactment/swapping tasks essentially involve combinations of target structure, ID and attribute. We aim to sufficiently decouple the control of these factors to achieve both tasks in one model. Specifically, our method contains: 1) A Spatial Condition Generator that provides precise landmarks and background; 2) A Plug-and-play Identity Encoder that transfers face embeddings to the text space by a transformer decoder. 3) An Attribute Controller that integrates spatial conditions and detailed attributes. Face-Adapter achieves comparable or even superior performance in terms of motion control precision, ID retention capability, and generation quality compared to fully fine-tuned face reenactment/swapping models. Additionally, Face-Adapter seamlessly integrates with various StableDiffusion models.

* Project Page: https://faceadapter.github.io/face-adapter.github.io/

Via

Access Paper or Ask Questions

Efficient Multimodal Large Language Models: A Survey

May 17, 2024

Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan(+3 more)

Figure 1 for Efficient Multimodal Large Language Models: A Survey

Figure 2 for Efficient Multimodal Large Language Models: A Survey

Figure 3 for Efficient Multimodal Large Language Models: A Survey

Figure 4 for Efficient Multimodal Large Language Models: A Survey

Abstract:In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions. Please refer to our GitHub repository for more details: https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey.

Via

Access Paper or Ask Questions