Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weizhong Zhang

Revisiting Energy-Based Model for Out-of-Distribution Detection

Dec 04, 2024

Yifan Wu, Xichen Ye, Songmin Dai, Dengye Pan, Xiaoqiang Li, Weizhong Zhang, Yifan Chen

Figure 1 for Revisiting Energy-Based Model for Out-of-Distribution Detection

Figure 2 for Revisiting Energy-Based Model for Out-of-Distribution Detection

Figure 3 for Revisiting Energy-Based Model for Out-of-Distribution Detection

Figure 4 for Revisiting Energy-Based Model for Out-of-Distribution Detection

Abstract:Out-of-distribution (OOD) detection is an essential approach to robustifying deep learning models, enabling them to identify inputs that fall outside of their trained distribution. Existing OOD detection methods usually depend on crafted data, such as specific outlier datasets or elaborate data augmentations. While this is reasonable, the frequent mismatch between crafted data and OOD data limits model robustness and generalizability. In response to this issue, we introduce Outlier Exposure by Simple Transformations (OEST), a framework that enhances OOD detection by leveraging "peripheral-distribution" (PD) data. Specifically, PD data are samples generated through simple data transformations, thus providing an efficient alternative to manually curated outliers. We adopt energy-based models (EBMs) to study PD data. We recognize the "energy barrier" in OOD detection, which characterizes the energy difference between in-distribution (ID) and OOD samples and eases detection. PD data are introduced to establish the energy barrier during training. Furthermore, this energy barrier concept motivates a theoretically grounded energy-barrier loss to replace the classical energy-bounded loss, leading to an improved paradigm, OEST*, which achieves a more effective and theoretically sound separation between ID and OOD samples. We perform empirical validation of our proposal, and extensive experiments across various benchmarks demonstrate that OEST* achieves better or similar accuracy compared with state-of-the-art methods.

* This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

Active Negative Loss: A Robust Framework for Learning with Noisy Labels

Dec 03, 2024

Xichen Ye, Yifan Wu, Yiwen Xu, Xiaoqiang Li, Weizhong Zhang, Yifan Chen

Figure 1 for Active Negative Loss: A Robust Framework for Learning with Noisy Labels

Figure 2 for Active Negative Loss: A Robust Framework for Learning with Noisy Labels

Figure 3 for Active Negative Loss: A Robust Framework for Learning with Noisy Labels

Figure 4 for Active Negative Loss: A Robust Framework for Learning with Noisy Labels

Abstract:Deep supervised learning has achieved remarkable success across a wide range of tasks, yet it remains susceptible to overfitting when confronted with noisy labels. To address this issue, noise-robust loss functions offer an effective solution for enhancing learning in the presence of label noise. In this work, we systematically investigate the limitation of the recently proposed Active Passive Loss (APL), which employs Mean Absolute Error (MAE) as its passive loss function. Despite the robustness brought by MAE, one of its key drawbacks is that it pays equal attention to clean and noisy samples; this feature slows down convergence and potentially makes training difficult, particularly in large-scale datasets. To overcome these challenges, we introduce a novel loss function class, termed Normalized Negative Loss Functions (NNLFs), which serve as passive loss functions within the APL framework. NNLFs effectively address the limitations of MAE by concentrating more on memorized clean samples. By replacing MAE in APL with our proposed NNLFs, we enhance APL and present a new framework called Active Negative Loss (ANL). Moreover, in non-symmetric noise scenarios, we propose an entropy-based regularization technique to mitigate the vulnerability to the label imbalance. Extensive experiments demonstrate that the new loss functions adopted by our ANL framework can achieve better or comparable performance to state-of-the-art methods across various label noise types and in image segmentation tasks. The source code is available at: https://github.com/Virusdoll/Active-Negative-Loss.

* This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

MagicFace: Training-free Universal-Style Human Image Customized Synthesis

Aug 15, 2024

Yibin Wang, Weizhong Zhang, Cheng Jin

Abstract:Existing human image personalized generation methods often require tedious training: either fine-tuning with a few images or retraining on large-scale datasets. In such cases, these methods are prone to overfitting and encounter difficulties when personalizing individuals of diverse styles. Moreover, these training-based approaches also struggle with multi-concept human image customizing. To this end, we propose MagicFace, the first method for universal-style human image personalized synthesis that enables single/multi-concept customization for humans of any style in a training-free manner. MagicFace introduces a coarse-to-fine generation pipeline, involving two sequential stages: semantic scene construction and concept feature injection. This is achieved by our Reference-aware Self-Attention (RSA) and Region-grouped Blend Attention (RBA) mechanisms. Specifically, in the first stage, RSA enables the latent image to query features from reference concepts simultaneously, extracting the coarse-grained overall semantic understanding to facilitate the initial semantic layout establishment. In the second stage, we employ an attention-based semantic segmentation method to pinpoint the generated regions of all concepts in the latent image at each step. Following this, RBA divides the pixels of the latent image into semantic groups, with each group querying fine-grained features from its reference concept, which ensures precise attribute alignment and feature injection. Throughout the two-stage process, a weight mask strategy is employed to ensure the model focuses more on the reference concepts. Extensive experiments demonstrate our superiority in both human-centric subject-to-image synthesis and multi-concept human image customization. Our approach also can be applied to texture transformation, further enhancing its versatility and applicability.

* project page: https://codegoat24.github.io/MagicFace

Via

Access Paper or Ask Questions

TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data

Jul 21, 2024

Jipeng Zhang, Yaxuan Qin, Renjie Pi, Weizhong Zhang, Rui Pan, Tong Zhang

Figure 1 for TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data

Figure 2 for TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data

Figure 3 for TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data

Figure 4 for TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data

Abstract:Instruction tuning has achieved unprecedented success in NLP, turning large language models into versatile chatbots. However, the increasing variety and volume of instruction datasets demand significant computational resources. To address this, it is essential to extract a small and highly informative subset (i.e., Coreset) that achieves comparable performance to the full dataset. Achieving this goal poses non-trivial challenges: 1) data selection requires accurate data representations that reflect the training samples' quality, 2) considering the diverse nature of instruction datasets, and 3) ensuring the efficiency of the coreset selection algorithm for large models. To address these challenges, we propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS). Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection. Experimental results show that our algorithm, selecting only 5% of the data, surpasses other unsupervised methods and achieves performance close to that of the full dataset.

* Preprint. Our code and models are available at: https://github.com/2003pro/TAGCOS

Via

Access Paper or Ask Questions

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Jun 15, 2024

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Figure 1 for Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Figure 2 for Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Figure 3 for Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Figure 4 for Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Abstract:Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.

* 17 pages

Via

Access Paper or Ask Questions

High Fidelity Scene Text Synthesis

May 23, 2024

Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin

Figure 1 for High Fidelity Scene Text Synthesis

Figure 2 for High Fidelity Scene Text Synthesis

Figure 3 for High Fidelity Scene Text Synthesis

Figure 4 for High Fidelity Scene Text Synthesis

Abstract:Scene text synthesis involves rendering specified texts onto arbitrary images. Current methods typically formulate this task in an end-to-end manner but lack effective character-level guidance during training. Besides, their text encoders, pre-trained on a single font type, struggle to adapt to the diverse font styles encountered in practical applications. Consequently, these methods suffer from character distortion, repetition, and absence, particularly in polystylistic scenarios. To this end, this paper proposes DreamText for high-fidelity scene text synthesis. Our key idea is to reconstruct the diffusion training process, introducing more refined guidance tailored to this task, to expose and rectify the model's attention at the character level and strengthen its learning of text regions. This transformation poses a hybrid optimization challenge, involving both discrete and continuous variables. To effectively tackle this challenge, we employ a heuristic alternate optimization strategy. Meanwhile, we jointly train the text encoder and generator to comprehensively learn and utilize the diverse font present in the training dataset. This joint training is seamlessly integrated into the alternate optimization process, fostering a synergistic relationship between learning character embedding and re-estimating character attention. Specifically, in each step, we first encode potential character-generated position information from cross-attention maps into latent character masks. These masks are then utilized to update the representation of specific characters in the current step, which, in turn, enables the generator to correct the character's attention in the subsequent steps. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art.

Via

Access Paper or Ask Questions

Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost

May 09, 2024

Yuan Gao, Weizhong Zhang, Wenhan Luo, Lin Ma, Jin-Gang Yu, Gui-Song Xia, Jiayi Ma

Figure 1 for Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost

Figure 2 for Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost

Figure 3 for Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost

Figure 4 for Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost

Abstract:We aim at exploiting additional auxiliary labels from an independent (auxiliary) task to boost the primary task performance which we focus on, while preserving a single task inference cost of the primary task. While most existing auxiliary learning methods are optimization-based relying on loss weights/gradients manipulation, our method is architecture-based with a flexible asymmetric structure for the primary and auxiliary tasks, which produces different networks for training and inference. Specifically, starting from two single task networks/branches (each representing a task), we propose a novel method with evolving networks where only primary-to-auxiliary links exist as the cross-task connections after convergence. These connections can be removed during the primary task inference, resulting in a single-task inference cost. We achieve this by formulating a Neural Architecture Search (NAS) problem, where we initialize bi-directional connections in the search space and guide the NAS optimization converging to an architecture with only the single-side primary-to-auxiliary connections. Moreover, our method can be incorporated with optimization-based auxiliary learning approaches. Extensive experiments with six tasks on NYU v2, CityScapes, and Taskonomy datasets using VGG, ResNet, and ViT backbones validate the promising performance. The codes are available at https://github.com/ethanygao/Aux-NAS.

* International Conference on Learning Representations (ICLR), 2024
* Accepted to ICLR 2024

Via

Access Paper or Ask Questions

ELiTe: Efficient Image-to-LiDAR Knowledge Transfer for Semantic Segmentation

May 07, 2024

Zhibo Zhang, Ximing Yang, Weizhong Zhang, Cheng Jin

Abstract:Cross-modal knowledge transfer enhances point cloud representation learning in LiDAR semantic segmentation. Despite its potential, the \textit{weak teacher challenge} arises due to repetitive and non-diverse car camera images and sparse, inaccurate ground truth labels. To address this, we propose the Efficient Image-to-LiDAR Knowledge Transfer (ELiTe) paradigm. ELiTe introduces Patch-to-Point Multi-Stage Knowledge Distillation, transferring comprehensive knowledge from the Vision Foundation Model (VFM), extensively trained on diverse open-world images. This enables effective knowledge transfer to a lightweight student model across modalities. ELiTe employs Parameter-Efficient Fine-Tuning to strengthen the VFM teacher and expedite large-scale model training with minimal costs. Additionally, we introduce the Segment Anything Model based Pseudo-Label Generation approach to enhance low-quality image labels, facilitating robust semantic representations. Efficient knowledge transfer in ELiTe yields state-of-the-art results on the SemanticKITTI benchmark, outperforming real-time inference models. Our approach achieves this with significantly fewer parameters, confirming its effectiveness and efficiency.

* 9 pages, 6 figures, ICME 2024 oral

Via

Access Paper or Ask Questions

Robust Fine-tuning for Pre-trained 3D Point Cloud Models

Apr 25, 2024

Zhibo Zhang, Ximing Yang, Weizhong Zhang, Cheng Jin

Abstract:This paper presents a robust fine-tuning method designed for pre-trained 3D point cloud models, to enhance feature robustness in downstream fine-tuned models. We highlight the limitations of current fine-tuning methods and the challenges of learning robust models. The proposed method, named Weight-Space Ensembles for Fine-Tuning then Linear Probing (WiSE-FT-LP), integrates the original pre-training and fine-tuning models through weight space integration followed by Linear Probing. This approach significantly enhances the performance of downstream fine-tuned models under distribution shifts, improving feature robustness while maintaining high performance on the target distribution. We apply this robust fine-tuning method to mainstream 3D point cloud pre-trained models and evaluate the quality of model parameters and the degradation of downstream task performance. Experimental results demonstrate the effectiveness of WiSE-FT-LP in enhancing model robustness, effectively balancing downstream task performance and model feature robustness without altering the model structures.

* 9 pages, 5 figures

Via

Access Paper or Ask Questions

PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

Mar 08, 2024

Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin

Figure 1 for PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

Figure 2 for PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

Figure 3 for PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

Figure 4 for PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

Abstract:Image composition involves seamlessly integrating given objects into a specific visual context. The current training-free methods rely on composing attention weights from several samplers to guide the generator. However, since these weights are derived from disparate contexts, their combination leads to coherence confusion in synthesis and loss of appearance information. These issues worsen with their excessive focus on background generation, even when unnecessary in this task. This not only slows down inference but also compromises foreground generation quality. Moreover, these methods introduce unwanted artifacts in the transition area. In this paper, we formulate image composition as a subject-based local editing task, solely focusing on foreground generation. At each step, the edited foreground is combined with the noisy background to maintain scene consistency. To address the remaining issues, we propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels. This steering is predominantly achieved by our Correlation Diffuser, utilizing its self-attention layers at each step. Within these layers, the synthesized subject interacts with both the referenced object and background, capturing intricate details and coherent relationships. This prior information is encoded into the attention weights, which are then integrated into the self-attention layers of the generator to guide the synthesis process. Besides, we introduce a Region-constrained Cross-Attention to confine the impact of specific subject-related words to desired regions, addressing the unwanted artifacts shown in the prior method thereby further improving the coherence in the transition area. Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.

Via

Access Paper or Ask Questions