Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yang Liu

Generative Active Learning for Long-tailed Instance Segmentation

Jun 04, 2024

Muzhi Zhu, Chengxiang Fan, Hao Chen, Yang Liu, Weian Mao, Xiaogang Xu, Chunhua Shen

Abstract:Recently, large-scale language-image generative models have gained widespread attention and many works have utilized generated data from these models to further enhance the performance of perception tasks. However, not all generated data can positively impact downstream models, and these methods do not thoroughly explore how to better select and utilize generated data. On the other hand, there is still a lack of research oriented towards active learning on generated data. In this paper, we explore how to perform active learning specifically for generated data in the long-tailed instance segmentation task. Subsequently, we propose BSGAL, a new algorithm that online estimates the contribution of the generated data based on gradient cache. BSGAL can handle unlimited generated data and complex downstream segmentation tasks effectively. Experiments show that BSGAL outperforms the baseline approach and effectually improves the performance of long-tailed segmentation. Our code can be found at https://github.com/aim-uofa/DiverGen.

* Accepted by ICML 2024

Via

Access Paper or Ask Questions

MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Jun 04, 2024

Yu Zhang, Qi Zhang, Zixuan Gong, Yiwei Shi, Yepeng Liu, Duoqian Miao, Yang Liu, Ke Liu, Kun Yi, Wei Fan(+2 more)

Figure 1 for MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Figure 2 for MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Figure 3 for MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Figure 4 for MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Abstract:Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, leading to rapid advancements in multimodal studies. However, CLIP faces a notable challenge in terms of inefficient data utilization. It relies on a single contrastive supervision for each image-text pair during representation learning, disregarding a substantial amount of valuable information that could offer richer supervision. Additionally, the retention of non-informative tokens leads to increased computational demands and time costs, particularly in CLIP's ViT image encoder. To address these issues, we propose Multi-Perspective Language-Image Pretraining (MLIP). In MLIP, we leverage the frequency transform's sensitivity to both high and low-frequency variations, which complements the spatial domain's sensitivity limited to low-frequency variations only. By incorporating frequency transforms and token-level alignment, we expand CILP's single supervision into multi-domain and multi-level supervision, enabling a more thorough exploration of informative image features. Additionally, we introduce a token merging method guided by comprehensive semantics from the frequency and spatial domains. This allows us to merge tokens to multi-granularity tokens with a controllable compression rate to accelerate CLIP. Extensive experiments validate the effectiveness of our design.

* ICML 2024

Via

Access Paper or Ask Questions

DA-HFNet: Progressive Fine-Grained Forgery Image Detection and Localization Based on Dual Attention

Jun 04, 2024

Yang Liu, Xiaofei Li, Jun Zhang, Shengze Hu, Jun Lei

Figure 1 for DA-HFNet: Progressive Fine-Grained Forgery Image Detection and Localization Based on Dual Attention

Figure 2 for DA-HFNet: Progressive Fine-Grained Forgery Image Detection and Localization Based on Dual Attention

Figure 3 for DA-HFNet: Progressive Fine-Grained Forgery Image Detection and Localization Based on Dual Attention

Figure 4 for DA-HFNet: Progressive Fine-Grained Forgery Image Detection and Localization Based on Dual Attention

Abstract:The increasing difficulty in accurately detecting forged images generated by AIGC(Artificial Intelligence Generative Content) poses many risks, necessitating the development of effective methods to identify and further locate forged areas. In this paper, to facilitate research efforts, we construct a DA-HFNet forged image dataset guided by text or image-assisted GAN and Diffusion model. Our goal is to utilize a hierarchical progressive network to capture forged artifacts at different scales for detection and localization. Specifically, it relies on a dual-attention mechanism to adaptively fuse multi-modal image features in depth, followed by a multi-branch interaction network to thoroughly interact image features at different scales and improve detector performance by leveraging dependencies between layers. Additionally, we extract more sensitive noise fingerprints to obtain more prominent forged artifact features in the forged areas. Extensive experiments validate the effectiveness of our approach, demonstrating significant performance improvements compared to state-of-the-art methods for forged image detection and localization.The code and dataset will be released in the future.

Via

Access Paper or Ask Questions

Envisioning Outlier Exposure by Large Language Models for Out-of-Distribution Detection

Jun 02, 2024

Chentao Cao, Zhun Zhong, Zhanke Zhou, Yang Liu, Tongliang Liu, Bo Han

Abstract:Detecting out-of-distribution (OOD) samples is essential when deploying machine learning models in open-world scenarios. Zero-shot OOD detection, requiring no training on in-distribution (ID) data, has been possible with the advent of vision-language models like CLIP. Existing methods build a text-based classifier with only closed-set labels. However, this largely restricts the inherent capability of CLIP to recognize samples from large and open label space. In this paper, we propose to tackle this constraint by leveraging the expert knowledge and reasoning capability of large language models (LLM) to Envision potential Outlier Exposure, termed EOE, without access to any actual OOD data. Owing to better adaptation to open-world scenarios, EOE can be generalized to different tasks, including far, near, and fine-grained OOD detection. Technically, we design (1) LLM prompts based on visual similarity to generate potential outlier class labels specialized for OOD detection, as well as (2) a new score function based on potential outlier penalty to distinguish hard OOD samples effectively. Empirically, EOE achieves state-of-the-art performance across different OOD tasks and can be effectively scaled to the ImageNet-1K dataset. The code is publicly available at: https://github.com/tmlr-group/EOE.

* ICML 2024

Via

Access Paper or Ask Questions

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Jun 01, 2024

Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

Figure 1 for AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Figure 2 for AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Figure 3 for AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Figure 4 for AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Abstract:Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-sound generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective.

Via

Access Paper or Ask Questions

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

May 31, 2024

Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin

Figure 1 for Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Figure 2 for Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Figure 3 for Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Figure 4 for Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Abstract:Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed $\mathcal{I}$-GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at https://github.com/jiaxiaojunQAQ/I-GCG.

Via

Access Paper or Ask Questions

Text Modality Oriented Image Feature Extraction for Detecting Diffusion-based DeepFake

May 28, 2024

Di Yang, Yihao Huang, Qing Guo, Felix Juefei-Xu, Xiaojun Jia, Run Wang, Geguang Pu, Yang Liu

Abstract:The widespread use of diffusion methods enables the creation of highly realistic images on demand, thereby posing significant risks to the integrity and safety of online information and highlighting the necessity of DeepFake detection. Our analysis of features extracted by traditional image encoders reveals that both low-level and high-level features offer distinct advantages in identifying DeepFake images produced by various diffusion methods. Inspired by this finding, we aim to develop an effective representation that captures both low-level and high-level features to detect diffusion-based DeepFakes. To address the problem, we propose a text modality-oriented feature extraction method, termed TOFE. Specifically, for a given target image, the representation we discovered is a corresponding text embedding that can guide the generation of the target image with a specific text-to-image model. Experiments conducted across ten diffusion types demonstrate the efficacy of our proposed method.

Via

Access Paper or Ask Questions

C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models

May 28, 2024

Jiahuan Cao, Yongxin Shi, Dezhi Peng, Yang Liu, Lianwen Jin

Abstract:Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of Large Language Models (LLMs) for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities of LLMs. To fill this gap, this paper introduces C$^{3}$bench, a Comprehensive Classical Chinese understanding benchmark, which comprises 50,000 text pairs for five primary CCU tasks, including classification, retrieval, named entity recognition, punctuation, and translation. Furthermore, the data in C$^{3}$bench originates from ten different domains, covering most of the categories in classical Chinese. Leveraging the proposed C$^{3}$bench, we extensively evaluate the quantitative performance of 15 representative LLMs on all five CCU tasks. Our results not only establish a public leaderboard of LLMs' CCU capabilities but also gain some findings. Specifically, existing LLMs are struggle with CCU tasks and still inferior to supervised models. Additionally, the results indicate that CCU is a task that requires special attention. We believe this study could provide a standard benchmark, comprehensive baselines, and valuable insights for the future advancement of LLM-based CCU research. The evaluation pipeline and dataset are available at \url{https://github.com/SCUT-DLVCLab/C3bench}.

* 4 figures and 5 tables

Via

Access Paper or Ask Questions

OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

May 27, 2024

Guan Wang, Zhimin Li, Qingchao Chen, Yang Liu

Figure 1 for OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

Figure 2 for OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

Figure 3 for OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

Figure 4 for OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

Abstract:Dynamic Scene Graph Generation (DSGG) focuses on identifying visual relationships within the spatial-temporal domain of videos. Conventional approaches often employ multi-stage pipelines, which typically consist of object detection, temporal association, and multi-relation classification. However, these methods exhibit inherent limitations due to the separation of multiple stages, and independent optimization of these sub-problems may yield sub-optimal solutions. To remedy these limitations, we propose a one-stage end-to-end framework, termed OED, which streamlines the DSGG pipeline. This framework reformulates the task as a set prediction problem and leverages pair-wise features to represent each subject-object pair within the scene graph. Moreover, another challenge of DSGG is capturing temporal dependencies, we introduce a Progressively Refined Module (PRM) for aggregating temporal context without the constraints of additional trackers or handcrafted trajectories, enabling end-to-end optimization of the network. Extensive experiments conducted on the Action Genome benchmark demonstrate the effectiveness of our design. The code and models are available at \url{https://github.com/guanw-pku/OED}.

* Accepted by CVPR'24

Via

Access Paper or Ask Questions

Retro-prob: Retrosynthetic Planning Based on a Probabilistic Model

May 25, 2024

Chengyang Tian, Yangpeng Zhang, Yang Liu

Abstract:Retrosynthesis is a fundamental but challenging task in organic chemistry, with broad applications in fields such as drug design and synthesis. Given a target molecule, the goal of retrosynthesis is to find out a series of reactions which could be assembled into a synthetic route which starts from purchasable molecules and ends at the target molecule. The uncertainty of reactions used in retrosynthetic planning, which is caused by hallucinations of backward models, has recently been noticed. In this paper we propose a succinct probabilistic model to describe such uncertainty. Based on the model, we propose a new retrosynthesis planning algorithm called retro-prob to maximize the successful synthesis probability of target molecules, which acquires high efficiency by utilizing the chain rule of derivatives. Experiments on the Paroutes benchmark show that retro-prob outperforms previous algorithms, retro* and retro-fallback, both in speed and in the quality of synthesis plans.

Via

Access Paper or Ask Questions