Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fei Yang

Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier

Oct 29, 2024

Kai Wang, Fei Yang, Bogdan Raducanu, Joost van de Weijer

Figure 1 for Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier

Figure 2 for Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier

Figure 3 for Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier

Figure 4 for Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier

Abstract:With the advent of large pre-trained vision-language models such as CLIP, prompt learning methods aim to enhance the transferability of the CLIP model. They learn the prompt given few samples from the downstream task given the specific class names as prior knowledge, which we term as semantic-aware classification. However, in many realistic scenarios, we only have access to few samples and knowledge of the class names (e.g., when considering instances of classes). This challenging scenario represents the semantic-agnostic discriminative case. Text-to-Image (T2I) personalization methods aim to adapt T2I models to unseen concepts by learning new tokens and endowing these tokens with the capability of generating the learned concepts. These methods do not require knowledge of class names as a semantic-aware prior. Therefore, in this paper, we first explore Textual Inversion and reveal that the new concept tokens possess both generation and classification capabilities by regarding each category as a single concept. However, learning classifiers from single-concept textual inversion is limited since the learned tokens are suboptimal for the discriminative tasks. To mitigate this issue, we propose Multi-Class textual inversion, which includes a discriminative regularization term for the token updating process. Using this technique, our method MC-TI achieves stronger Semantic-Agnostic Classification while preserving the generation capability of these modifier tokens given only few samples per category. In the experiments, we extensively evaluate MC-TI on 12 datasets covering various scenarios, which demonstrates that MC-TI achieves superior results in terms of both classification and generation outcomes.

* Accepted in WACV 2025. Code link: https://github.com/wangkai930418/mc_ti

Via

Access Paper or Ask Questions

Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies

Oct 24, 2024

Luping Wang, Sheng Chen, Linnan Jiang, Shu Pan, Runze Cai, Sen Yang, Fei Yang

Figure 1 for Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies

Figure 2 for Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies

Figure 3 for Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies

Figure 4 for Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies

Abstract:The large models, as predicted by scaling raw forecasts, have made groundbreaking progress in many fields, particularly in natural language generation tasks, where they have approached or even surpassed human levels. However, the unprecedented scale of their parameters brings significant computational and storage costs. These large models require substantial computational resources and GPU memory to operate. When adapting large models to specific downstream tasks, their massive parameter scale poses a significant challenge in fine-tuning on hardware platforms with limited computational power and GPU memory. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) offers a practical solution by efficiently adjusting the parameters of large pre-trained models to suit various downstream tasks. Specifically, PEFT adjusts the parameters of pre-trained large models to adapt to specific tasks or domains, minimizing the introduction of additional parameters and the computational resources required. This review mainly introduces the preliminary knowledge of PEFT, the core ideas and principles of various PEFT algorithms, the applications of PEFT, and potential future research directions. By reading this review, we believe that interested parties can quickly grasp the PEFT methodology, thereby accelerating its development and innovation.

Via

Access Paper or Ask Questions

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

Oct 10, 2024

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao(+3 more)

Abstract:As visual generation technologies continue to advance, the scale of video datasets has expanded rapidly, and the quality of these datasets is critical to the performance of video generation models. We argue that temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality. However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. The core of our approach lies in improving the consistency between fine-grained conditions and video content. Specifically, we employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency. We then provide structured captions for the splitted videos, with an average length of 200 words, to improve text-video alignment. Additionally, we develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus. Finally, we incorporate several metrics into the training process of the generation model, further refining the fine-grained conditions. Our experiments demonstrate the effectiveness of our data processing pipeline and the quality of the proposed Koala-36M dataset. Our dataset and code will be released at https://koala36m.github.io/.

* Project page: https://koala36m.github.io/

Via

Access Paper or Ask Questions

GRIF-DM: Generation of Rich Impression Fonts using Diffusion Models

Aug 14, 2024

Lei Kang, Fei Yang, Kai Wang, Mohamed Ali Souibgui, Lluis Gomez, Alicia Fornés, Ernest Valveny, Dimosthenis Karatzas

Figure 1 for GRIF-DM: Generation of Rich Impression Fonts using Diffusion Models

Figure 2 for GRIF-DM: Generation of Rich Impression Fonts using Diffusion Models

Figure 3 for GRIF-DM: Generation of Rich Impression Fonts using Diffusion Models

Figure 4 for GRIF-DM: Generation of Rich Impression Fonts using Diffusion Models

Abstract:Fonts are integral to creative endeavors, design processes, and artistic productions. The appropriate selection of a font can significantly enhance artwork and endow advertisements with a higher level of expressivity. Despite the availability of numerous diverse font designs online, traditional retrieval-based methods for font selection are increasingly being supplanted by generation-based approaches. These newer methods offer enhanced flexibility, catering to specific user preferences and capturing unique stylistic impressions. However, current impression font techniques based on Generative Adversarial Networks (GANs) necessitate the utilization of multiple auxiliary losses to provide guidance during generation. Furthermore, these methods commonly employ weighted summation for the fusion of impression-related keywords. This leads to generic vectors with the addition of more impression keywords, ultimately lacking in detail generation capacity. In this paper, we introduce a diffusion-based method, termed \ourmethod, to generate fonts that vividly embody specific impressions, utilizing an input consisting of a single letter and a set of descriptive impression keywords. The core innovation of \ourmethod lies in the development of dual cross-attention modules, which process the characteristics of the letters and impression keywords independently but synergistically, ensuring effective integration of both types of information. Our experimental results, conducted on the MyFonts dataset, affirm that this method is capable of producing realistic, vibrant, and high-fidelity fonts that are closely aligned with user specifications. This confirms the potential of our approach to revolutionize font generation by accommodating a broad spectrum of user-driven design requirements. Our code is publicly available at \url{https://github.com/leitro/GRIF-DM}.

* Accepted to ECAI2024

Via

Access Paper or Ask Questions

Automatic Extraction of Relationships among Motivations, Emotions and Actions from Natural Language Texts

Aug 02, 2024

Fei Yang

Figure 1 for Automatic Extraction of Relationships among Motivations, Emotions and Actions from Natural Language Texts

Figure 2 for Automatic Extraction of Relationships among Motivations, Emotions and Actions from Natural Language Texts

Figure 3 for Automatic Extraction of Relationships among Motivations, Emotions and Actions from Natural Language Texts

Figure 4 for Automatic Extraction of Relationships among Motivations, Emotions and Actions from Natural Language Texts

Abstract:We propose a new graph-based framework to reveal relationships among motivations, emotions and actions explicitly given natural language texts. A directed acyclic graph is designed to describe human's nature. Nurture beliefs are incorporated to connect outside events and the human's nature graph. No annotation resources are required due to the power of large language models. Amazon Fine Foods Reviews dataset is used as corpus and food-related motivations are focused. Totally 92,990 relationship graphs are generated, of which 63% make logical sense. We make further analysis to investigate error types for optimization direction in future research.

Via

Access Paper or Ask Questions

Efficient Channel Estimation for Millimeter Wave and Terahertz Systems Enabled by Integrated Super-resolution Sensing and Communication

Jul 30, 2024

Jingran Xu, Huizhi Wang, Yong Zeng, Xiaoli Xu, Qingqing Wu, Fei Yang, Yan Chen, Abbas Jamalipour

Figure 1 for Efficient Channel Estimation for Millimeter Wave and Terahertz Systems Enabled by Integrated Super-resolution Sensing and Communication

Figure 2 for Efficient Channel Estimation for Millimeter Wave and Terahertz Systems Enabled by Integrated Super-resolution Sensing and Communication

Figure 3 for Efficient Channel Estimation for Millimeter Wave and Terahertz Systems Enabled by Integrated Super-resolution Sensing and Communication

Figure 4 for Efficient Channel Estimation for Millimeter Wave and Terahertz Systems Enabled by Integrated Super-resolution Sensing and Communication

Abstract:Integrated super-resolution sensing and communication (ISSAC) has emerged as a promising technology to achieve extremely high precision sensing for those key parameters, such as the angles of the sensing targets. In this paper, we propose an efficient channel estimation scheme enabled by ISSAC for millimeter wave (mmWave) and TeraHertz (THz) systems with a hybrid analog/digital beamforming architecture, where both the pilot overhead and the cost of radio frequency (RF) chains are significantly reduced. The key idea is to exploit the fact that subspace-based super-resolution algorithms such as multiple signal classification (MUSIC) can estimate channel parameters accurately without requiring dedicate a priori known pilots. In particular, the proposed method consists of two stages. First, the angles of the multi-path channel components are estimated in a pilot-free manner during the transmission of data symbols. Second, the multi-path channel coefficients are estimated with very few pilots. Compared to conventional channel estimation schemes that rely solely on channel training, our approach requires the estimation of much fewer parameters in the second stage. Furthermore, with channel multi-path angles obtained, the beamforming gain can be achieved when pilots are sent to estimate the channel path gains. To comprehensively investigate the performance of the proposed scheme, we consider both the basic line-of-sight (LoS) channels and more general multi-path channels. We compare the performance of the minimum mean square error (MMSE) of channel estimation and the resulting beamforming gains of our proposed scheme with the traditional scheme that rely exclusively on channel training. It is demonstrated that our proposed method significantly outperforms the benchmarking scheme. Simulation results are presented to validate our theoretical findings.

* 13 pages, 8 figures

Via

Access Paper or Ask Questions

ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

Jul 03, 2024

Yanfeng Jiang, Ning Sun, Xueshuo Xie, Fei Yang, Tao Li

Figure 1 for ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

Figure 2 for ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

Figure 3 for ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

Figure 4 for ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

Abstract:Vision Transformers (ViTs) have exhibited exceptional performance across diverse computer vision tasks, while their substantial parameter size incurs significantly increased memory and computational demands, impeding effective inference on resource-constrained devices. Quantization has emerged as a promising solution to mitigate these challenges, yet existing methods still suffer from significant accuracy loss at low-bit. We attribute this issue to the distinctive distributions of post-LayerNorm and post-GELU activations within ViTs, rendering conventional hardware-friendly quantizers ineffective, particularly in low-bit scenarios. To address this issue, we propose a novel framework called Activation-Distribution-Friendly post-training Quantization for Vision Transformers, ADFQ-ViT. Concretely, we introduce the Per-Patch Outlier-aware Quantizer to tackle irregular outliers in post-LayerNorm activations. This quantizer refines the granularity of the uniform quantizer to a per-patch level while retaining a minimal subset of values exceeding a threshold at full-precision. To handle the non-uniform distributions of post-GELU activations between positive and negative regions, we design the Shift-Log2 Quantizer, which shifts all elements to the positive region and then applies log2 quantization. Moreover, we present the Attention-score enhanced Module-wise Optimization which adjusts the parameters of each quantizer by reconstructing errors to further mitigate quantization error. Extensive experiments demonstrate ADFQ-ViT provides significant improvements over various baselines in image classification, object detection, and instance segmentation tasks at 4-bit. Specifically, when quantizing the ViT-B model to 4-bit, we achieve a 10.23% improvement in Top-1 accuracy on the ImageNet dataset.

* 28 pages,9 figures

Via

Access Paper or Ask Questions

LocInv: Localization-aware Inversion for Text-Guided Image Editing

May 02, 2024

Chuanming Tang, Kai Wang, Fei Yang, Joost van de Weijer

Figure 1 for LocInv: Localization-aware Inversion for Text-Guided Image Editing

Figure 2 for LocInv: Localization-aware Inversion for Text-Guided Image Editing

Figure 3 for LocInv: Localization-aware Inversion for Text-Guided Image Editing

Figure 4 for LocInv: Localization-aware Inversion for Text-Guided Image Editing

Abstract:Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at https://github.com/wangkai930418/DPL

* Accepted by CVPR 2024 Workshop AI4CC

Via

Access Paper or Ask Questions

Machine Unlearning for Document Classification

Apr 29, 2024

Lei Kang, Mohamed Ali Souibgui, Fei Yang, Lluis Gomez, Ernest Valveny, Dimosthenis Karatzas

Figure 1 for Machine Unlearning for Document Classification

Figure 2 for Machine Unlearning for Document Classification

Figure 3 for Machine Unlearning for Document Classification

Figure 4 for Machine Unlearning for Document Classification

Abstract:Document understanding models have recently demonstrated remarkable performance by leveraging extensive collections of user documents. However, since documents often contain large amounts of personal data, their usage can pose a threat to user privacy and weaken the bonds of trust between humans and AI services. In response to these concerns, legislation advocating ``the right to be forgotten" has recently been proposed, allowing users to request the removal of private information from computer systems and neural network models. A novel approach, known as machine unlearning, has emerged to make AI models forget about a particular class of data. In our research, we explore machine unlearning for document classification problems, representing, to the best of our knowledge, the first investigation into this area. Specifically, we consider a realistic scenario where a remote server houses a well-trained model and possesses only a small portion of training data. This setup is designed for efficient forgetting manipulation. This work represents a pioneering step towards the development of machine unlearning methods aimed at addressing privacy concerns in document analysis applications. Our code is publicly available at \url{https://github.com/leitro/MachineUnlearning-DocClassification}.

* Accepted to ICDAR2024

Via

Access Paper or Ask Questions

Parallel Proportional Fusion of Spiking Quantum Neural Network for Optimizing Image Classification

Apr 01, 2024

Zuyu Xu, Kang Shen, Pengnian Cai, Tao Yang, Yuanming Hu, Shixian Chen, Yunlai Zhu, Zuheng Wu, Yuehua Dai, Jun Wang(+1 more)

Abstract:The recent emergence of the hybrid quantum-classical neural network (HQCNN) architecture has garnered considerable attention due to the potential advantages associated with integrating quantum principles to enhance various facets of machine learning algorithms and computations. However, the current investigated serial structure of HQCNN, wherein information sequentially passes from one network to another, often imposes limitations on the trainability and expressivity of the network. In this study, we introduce a novel architecture termed Parallel Proportional Fusion of Quantum and Spiking Neural Networks (PPF-QSNN). The dataset information is simultaneously fed into both the spiking neural network and the variational quantum circuits, with the outputs amalgamated in proportion to their individual contributions. We systematically assess the impact of diverse PPF-QSNN parameters on network performance for image classification, aiming to identify the optimal configuration. Numerical results on the MNIST dataset unequivocally illustrate that our proposed PPF-QSNN outperforms both the existing spiking neural network and the serial quantum neural network across metrics such as accuracy, loss, and robustness. This study introduces a novel and effective amalgamation approach for HQCNN, thereby laying the groundwork for the advancement and application of quantum advantage in artificial intelligent computations.

Via

Access Paper or Ask Questions