Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chang Xu

Towards Memorization-Free Diffusion Models

Apr 01, 2024

Chen Chen, Daochang Liu, Chang Xu

Abstract:Pretrained diffusion models and their outputs are widely accessible due to their exceptional capacity for synthesizing high-quality images and their open-source nature. The users, however, may face litigation risks owing to the models' tendency to memorize and regurgitate training data during inference. To address this, we introduce Anti-Memorization Guidance (AMG), a novel framework employing three targeted guidance strategies for the main causes of memorization: image and caption duplication, and highly specific user prompts. Consequently, AMG ensures memorization-free outputs while maintaining high image quality and text alignment, leveraging the synergy of its guidance methods, each indispensable in its own right. AMG also features an innovative automatic detection system for potential memorization during each step of inference process, allows selective application of guidance strategies, minimally interfering with the original sampling process to preserve output utility. We applied AMG to pretrained Denoising Diffusion Probabilistic Models (DDPM) and Stable Diffusion across various generation tasks. The results demonstrate that AMG is the first approach to successfully eradicates all instances of memorization with no or marginal impacts on image quality and text-alignment, as evidenced by FID and CLIP scores.

* CVPR2024

Via

Access Paper or Ask Questions

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Mar 20, 2024

Li Mi, Chang Xu, Javiera Castillo-Navarro, Syrielle Montariol, Wen Yang, Antoine Bosselut, Devis Tuia

Figure 1 for ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Figure 2 for ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Figure 3 for ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Figure 4 for ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Abstract:Cross-view geo-localization aims at localizing a ground-level query image by matching it to its corresponding geo-referenced aerial view. In real-world scenarios, the task requires accommodating diverse ground images captured by users with varying orientations and reduced field of views (FoVs). However, existing learning pipelines are orientation-specific or FoV-specific, demanding separate model training for different ground view variations. Such models heavily depend on the North-aligned spatial correspondence and predefined FoVs in the training data, compromising their robustness across different settings. To tackle this challenge, we propose ConGeo, a single- and cross-modal Contrastive method for Geo-localization: it enhances robustness and consistency in feature representations to improve a model's invariance to orientation and its resilience to FoV variations, by enforcing proximity between ground view variations of the same location. As a generic learning objective for cross-view geo-localization, when integrated into state-of-the-art pipelines, ConGeo significantly boosts the performance of three base models on four geo-localization benchmarks for diverse ground view variations and outperforms competing methods that train separate models for each ground view variation.

* Project page at https://chasel-tsui.github.io/ConGeo/

Via

Access Paper or Ask Questions

Learning Cross-view Visual Geo-localization without Ground Truth

Mar 19, 2024

Haoyuan Li, Chang Xu, Wen Yang, Huai Yu, Gui-Song Xia

Abstract:Cross-View Geo-Localization (CVGL) involves determining the geographical location of a query image by matching it with a corresponding GPS-tagged reference image. Current state-of-the-art methods predominantly rely on training models with labeled paired images, incurring substantial annotation costs and training burdens. In this study, we investigate the adaptation of frozen models for CVGL without requiring ground truth pair labels. We observe that training on unlabeled cross-view images presents significant challenges, including the need to establish relationships within unlabeled data and reconcile view discrepancies between uncertain queries and references. To address these challenges, we propose a self-supervised learning framework to train a learnable adapter for a frozen Foundation Model (FM). This adapter is designed to map feature distributions from diverse views into a uniform space using unlabeled data exclusively. To establish relationships within unlabeled data, we introduce an Expectation-Maximization-based Pseudo-labeling module, which iteratively estimates associations between cross-view features and optimizes the adapter. To maintain the robustness of the FM's representation, we incorporate an information consistency module with a reconstruction loss, ensuring that adapted features retain strong discriminative ability across views. Experimental results demonstrate that our proposed method achieves significant improvements over vanilla FMs and competitive accuracy compared to supervised methods, while necessitating fewer training parameters and relying solely on unlabeled data. Evaluation of our adaptation for task-specific models further highlights its broad applicability.

Via

Access Paper or Ask Questions

Collage Prompting: Budget-Friendly Visual Recognition with GPT-4V

Mar 18, 2024

Siyu Xu, Yunke Wang, Daochang Liu, Chang Xu

Figure 1 for Collage Prompting: Budget-Friendly Visual Recognition with GPT-4V

Figure 2 for Collage Prompting: Budget-Friendly Visual Recognition with GPT-4V

Figure 3 for Collage Prompting: Budget-Friendly Visual Recognition with GPT-4V

Figure 4 for Collage Prompting: Budget-Friendly Visual Recognition with GPT-4V

Abstract:Recent advancements in generative AI have suggested that by taking visual prompt, GPT-4V can demonstrate significant proficiency in image recognition task. Despite its impressive capabilities, the financial cost associated with GPT-4V's inference presents a substantial barrier for its wide use. To address this challenge, our work introduces Collage Prompting, a budget-friendly prompting approach that concatenates multiple images into a single visual input. With collage prompt, GPT-4V is able to perform image recognition on several images simultaneously. Based on the observation that the accuracy of GPT-4V's image recognition varies significantly with the order of images within the collage prompt, our method further learns to optimize the arrangement of images for maximum recognition accuracy. A graph predictor is trained to indicate the accuracy of each collage prompt, then we propose an optimization method to navigate the search space of possible image arrangements. Experiment results across various datasets demonstrate the cost-efficiency score of collage prompt is much larger than standard prompt. Additionally, collage prompt with learned arrangement achieves clearly better accuracy than collage prompt with random arrangement in GPT-4V's visual recognition.

Via

Access Paper or Ask Questions

Understanding Robustness of Visual State Space Models for Image Classification

Mar 16, 2024

Chengbin Du, Yanxi Li, Chang Xu

Abstract:Visual State Space Model (VMamba) has recently emerged as a promising architecture, exhibiting remarkable performance in various computer vision tasks. However, its robustness has not yet been thoroughly studied. In this paper, we delve into the robustness of this architecture through comprehensive investigations from multiple perspectives. Firstly, we investigate its robustness to adversarial attacks, employing both whole-image and patch-specific adversarial attacks. Results demonstrate superior adversarial robustness compared to Transformer architectures while revealing scalability weaknesses. Secondly, the general robustness of VMamba is assessed against diverse scenarios, including natural adversarial examples, out-of-distribution data, and common corruptions. VMamba exhibits exceptional generalizability with out-of-distribution data but shows scalability weaknesses against natural adversarial examples and common corruptions. Additionally, we explore VMamba's gradients and back-propagation during white-box attacks, uncovering unique vulnerabilities and defensive capabilities of its novel components. Lastly, the sensitivity of VMamba to image structure variations is examined, highlighting vulnerabilities associated with the distribution of disturbance areas and spatial information, with increased susceptibility closer to the image center. Through these comprehensive studies, we contribute to a deeper understanding of VMamba's robustness, providing valuable insights for refining and advancing the capabilities of deep neural networks in computer vision applications.

* 27 pages

Via

Access Paper or Ask Questions

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Mar 15, 2024

Xiaohuan Pei, Tao Huang, Chang Xu

Figure 1 for EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Figure 2 for EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Figure 3 for EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Figure 4 for EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Abstract:Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational demands $\mathcal{O}(N^2)$. This ongoing trade-off between accuracy and efficiency remains a significant hurdle. Recently, state space models (SSMs), such as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to $\mathcal{O}(N)$. Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with $1.3$G FLOPs improves Vim-Ti with $1.5$G FLOPs by a large margin of $5.6\%$ accuracy on ImageNet. Code is available at: \url{https://github.com/TerryPei/EfficientVMamba}.

Via

Access Paper or Ask Questions

LocalMamba: Visual State Space Model with Windowed Selective Scan

Mar 14, 2024

Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu

Figure 1 for LocalMamba: Visual State Space Model with Windowed Selective Scan

Figure 2 for LocalMamba: Visual State Space Model with Windowed Selective Scan

Figure 3 for LocalMamba: Visual State Space Model with Windowed Selective Scan

Figure 4 for LocalMamba: Visual State Space Model with Windowed Selective Scan

Abstract:Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.

Via

Access Paper or Ask Questions

Active Generation for Image Classification

Mar 11, 2024

Tao Huang, Jiaqi Liu, Shan You, Chang Xu

Figure 1 for Active Generation for Image Classification

Figure 2 for Active Generation for Image Classification

Figure 3 for Active Generation for Image Classification

Figure 4 for Active Generation for Image Classification

Abstract:Recently, the growing capabilities of deep generative models have underscored their potential in enhancing image classification accuracy. However, existing methods often demand the generation of a disproportionately large number of images compared to the original dataset, while having only marginal improvements in accuracy. This computationally expensive and time-consuming process hampers the practicality of such approaches. In this paper, we propose to address the efficiency of image generation by focusing on the specific needs and characteristics of the model. With a central tenet of active learning, our method, named ActGen, takes a training-aware approach to image generation. It aims to create images akin to the challenging or misclassified samples encountered by the current model and incorporates these generated images into the training set to augment model performance. ActGen introduces an attentive image guidance technique, using real images as guides during the denoising process of a diffusion model. The model's attention on class prompt is leveraged to ensure the preservation of similar foreground object while diversifying the background. Furthermore, we introduce a gradient-based generation guidance method, which employs two losses to generate more challenging samples and prevent the generated images from being too similar to previously generated ones. Experimental results on the CIFAR and ImageNet datasets demonstrate that our method achieves better performance with a significantly reduced number of generated images.

Via

Access Paper or Ask Questions

MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process

Mar 09, 2024

Xinyao Fan, Yueying Wu, Chang Xu, Yuhao Huang, Weiqing Liu, Jiang Bian

Figure 1 for MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process

Figure 2 for MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process

Figure 3 for MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process

Figure 4 for MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process

Abstract:Recently, diffusion probabilistic models have attracted attention in generative time series forecasting due to their remarkable capacity to generate high-fidelity samples. However, the effective utilization of their strong modeling ability in the probabilistic time series forecasting task remains an open question, partially due to the challenge of instability arising from their stochastic nature. To address this challenge, we introduce a novel Multi-Granularity Time Series Diffusion (MG-TSD) model, which achieves state-of-the-art predictive performance by leveraging the inherent granularity levels within the data as given targets at intermediate diffusion steps to guide the learning process of diffusion models. The way to construct the targets is motivated by the observation that the forward process of the diffusion model, which sequentially corrupts the data distribution to a standard normal distribution, intuitively aligns with the process of smoothing fine-grained data into a coarse-grained representation, both of which result in a gradual loss of fine distribution features. In the study, we derive a novel multi-granularity guidance diffusion loss function and propose a concise implementation method to effectively utilize coarse-grained data across various granularity levels. More importantly, our approach does not rely on additional external data, making it versatile and applicable across various domains. Extensive experiments conducted on real-world datasets demonstrate that our MG-TSD model outperforms existing time series prediction methods.

* International Conference on Learning Representations (ICLR) 2024

Via

Access Paper or Ask Questions

Data-efficient Large Vision Models through Sequential Autoregression

Feb 07, 2024

Jianyuan Guo, Zhiwei Hao, Chengcheng Wang, Yehui Tang, Han Wu, Han Hu, Kai Han, Chang Xu

Figure 1 for Data-efficient Large Vision Models through Sequential Autoregression

Figure 2 for Data-efficient Large Vision Models through Sequential Autoregression

Figure 3 for Data-efficient Large Vision Models through Sequential Autoregression

Figure 4 for Data-efficient Large Vision Models through Sequential Autoregression

Abstract:Training general-purpose vision models on purely sequential visual data, eschewing linguistic inputs, has heralded a new frontier in visual understanding. These models are intended to not only comprehend but also seamlessly transit to out-of-domain tasks. However, current endeavors are hamstrung by an over-reliance on colossal models, exemplified by models with upwards of 3B parameters, and the necessity for an extensive corpus of visual data, often comprising a staggering 400B tokens. In this paper, we delve into the development of an efficient, autoregression-based vision model, innovatively architected to operate on a limited dataset. We meticulously demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding during the testing phase. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint, and a marked decrease in training data requirements, thereby paving the way for more sustainable and accessible advancements in the field of generalist vision models. The code is available at https://github.com/ggjy/DeLVM.

* 15 pages

Via

Access Paper or Ask Questions