Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuhang Li

GiVE: Guiding Visual Encoder to Perceive Overlooked Information

Oct 26, 2024

Junjie Li, Jianghong Ma, Xiaofeng Zhang, Yuhang Li, Jianyang Shi

Abstract:Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval, and the Multi-Object Instruction (MOInst) dataset. Experiments show our approach achieves state-of-the-art performance.

Via

Access Paper or Ask Questions

TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

Oct 24, 2024

Yuhang Li, Priyadarshini Panda

Figure 1 for TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

Figure 2 for TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

Figure 3 for TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

Figure 4 for TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

Abstract:Large language models (LLMs) have revolutionized natural language processing, albeit at the cost of immense memory and computation requirements. Post-training quantization (PTQ) is becoming the de facto method to reduce the memory footprint and improve the inference throughput of LLMs. In this work, we aim to push the upper limit of LLM PTQ by optimizing the weight rounding parameters with the block reconstruction technique, a predominant method in previous vision models. We propose TesseraQ, a new state-of-the-art PTQ technique, to quantize the weights of LLMs to ultra-low bits. To effectively optimize the rounding in LLMs and stabilize the reconstruction process, we introduce progressive adaptive rounding. This approach iteratively transits the soft rounding variables to hard variables during the reconstruction process. Additionally, we optimize the dequantization scale parameters to fully leverage the block reconstruction technique. We demonstrate that TesseraQ can be seamlessly integrated with existing scaling or clipping-based PTQ algorithms such as AWQ and OmniQuant, significantly enhancing their performance and establishing a new state-of-the-art. For instance, when compared to AWQ, TesseraQ improves the wikitext2 perplexity from 14.65 to 6.82 and average downstream accuracy from 50.52 to 59.27 with 2-bit weight-only quantization of LLaMA-2-7B. Across a range of quantization schemes, including W2A16, W3A16, W3A3, and W4A4, TesseraQ consistently exhibits superior performance.

Via

Access Paper or Ask Questions

Optical Generative Models

Oct 23, 2024

Shiqi Chen, Yuhang Li, Hanlong Chen, Aydogan Ozcan

Abstract:Generative models cover various application areas, including image, video and music synthesis, natural language processing, and molecular design, among many others. As digital generative models become larger, scalable inference in a fast and energy-efficient manner becomes a challenge. Here, we present optical generative models inspired by diffusion models, where a shallow and fast digital encoder first maps random noise into phase patterns that serve as optical generative seeds for a desired data distribution; a jointly-trained free-space-based reconfigurable decoder all-optically processes these generative seeds to create novel images (never seen before) following the target data distribution. Except for the illumination power and the random seed generation through a shallow encoder, these optical generative models do not consume computing power during the synthesis of novel images. We report the optical generation of monochrome and multi-color novel images of handwritten digits, fashion products, butterflies, and human faces, following the data distributions of MNIST, Fashion MNIST, Butterflies-100, and Celeb-A datasets, respectively, achieving an overall performance comparable to digital neural network-based generative models. To experimentally demonstrate optical generative models, we used visible light to generate, in a snapshot, novel images of handwritten digits and fashion products. These optical generative models might pave the way for energy-efficient, scalable and rapid inference tasks, further exploiting the potentials of optics and photonics for artificial intelligence-generated content.

* 24 Pages, 9 Figures

Via

Access Paper or Ask Questions

Lying mirror

Oct 20, 2024

Yuhang Li, Shiqi Chen, Bijie Bai, Aydogan Ozcan

Abstract:We introduce an all-optical system, termed the "lying mirror", to hide input information by transforming it into misleading, ordinary-looking patterns that effectively camouflage the underlying image data and deceive the observers. This misleading transformation is achieved through passive light-matter interactions of the incident light with an optimized structured diffractive surface, enabling the optical concealment of any form of secret input data without any digital computing. These lying mirror designs were shown to camouflage different types of input image data, exhibiting robustness against a range of adversarial manipulations, including random image noise as well as unknown, random rotations, shifts, and scaling of the object features. The feasibility of the lying mirror concept was also validated experimentally using a structured micro-mirror array along with multi-wavelength illumination at 480, 550 and 600 nm, covering the blue, green and red image channels. This framework showcases the power of structured diffractive surfaces for visual information processing and might find various applications in defense, security and entertainment.

* 21 Pages, 8 Figures

Via

Access Paper or Ask Questions

Spiking Transformer with Spatial-Temporal Attention

Sep 29, 2024

Donghyun Lee, Yuhang Li, Youngeun Kim, Shiting Xiao, Priyadarshini Panda

Figure 1 for Spiking Transformer with Spatial-Temporal Attention

Figure 2 for Spiking Transformer with Spatial-Temporal Attention

Figure 3 for Spiking Transformer with Spatial-Temporal Attention

Figure 4 for Spiking Transformer with Spatial-Temporal Attention

Abstract:Spiking Neural Networks (SNNs) present a compelling and energy-efficient alternative to traditional Artificial Neural Networks (ANNs) due to their sparse binary activation. Leveraging the success of the transformer architecture, the spiking transformer architecture is explored to scale up dataset size and performance. However, existing works only consider the spatial self-attention in spiking transformer, neglecting the inherent temporal context across the timesteps. In this work, we introduce Spiking Transformer with Spatial-Temporal Attention (STAtten), a simple and straightforward architecture designed to integrate spatial and temporal information in self-attention with negligible additional computational load. The STAtten divides the temporal or token index and calculates the self-attention in a cross-manner to effectively incorporate spatial-temporal information. We first verify our spatial-temporal attention mechanism's ability to capture long-term temporal dependencies using sequential datasets. Moreover, we validate our approach through extensive experiments on varied datasets, including CIFAR10/100, ImageNet, CIFAR10-DVS, and N-Caltech101. Notably, our cross-attention mechanism achieves an accuracy of 78.39 % on the ImageNet dataset.

Via

Access Paper or Ask Questions

ReSpike: Residual Frames-based Hybrid Spiking Neural Networks for Efficient Action Recognition

Sep 03, 2024

Shiting Xiao, Yuhang Li, Youngeun Kim, Donghyun Lee, Priyadarshini Panda

Abstract:Spiking Neural Networks (SNNs) have emerged as a compelling, energy-efficient alternative to traditional Artificial Neural Networks (ANNs) for static image tasks such as image classification and segmentation. However, in the more complex video classification domain, SNN-based methods fall considerably short of ANN-based benchmarks due to the challenges in processing dense frame sequences. To bridge this gap, we propose ReSpike, a hybrid framework that synergizes the strengths of ANNs and SNNs to tackle action recognition tasks with high accuracy and low energy cost. By decomposing film clips into spatial and temporal components, i.e., RGB image Key Frames and event-like Residual Frames, ReSpike leverages ANN for learning spatial information and SNN for learning temporal information. In addition, we propose a multi-scale cross-attention mechanism for effective feature fusion. Compared to state-of-the-art SNN baselines, our ReSpike hybrid architecture demonstrates significant performance improvements (e.g., >30% absolute accuracy improvement on HMDB-51, UCF-101, and Kinetics-400). Furthermore, ReSpike achieves comparable performance with prior ANN approaches while bringing better accuracy-energy tradeoff.

Via

Access Paper or Ask Questions

When In-memory Computing Meets Spiking Neural Networks -- A Perspective on Device-Circuit-System-and-Algorithm Co-design

Aug 22, 2024

Abhishek Moitra, Abhiroop Bhattacharjee, Yuhang Li, Youngeun Kim, Priyadarshini Panda

Figure 1 for When In-memory Computing Meets Spiking Neural Networks -- A Perspective on Device-Circuit-System-and-Algorithm Co-design

Figure 2 for When In-memory Computing Meets Spiking Neural Networks -- A Perspective on Device-Circuit-System-and-Algorithm Co-design

Figure 3 for When In-memory Computing Meets Spiking Neural Networks -- A Perspective on Device-Circuit-System-and-Algorithm Co-design

Figure 4 for When In-memory Computing Meets Spiking Neural Networks -- A Perspective on Device-Circuit-System-and-Algorithm Co-design

Abstract:This review explores the intersection of bio-plausible artificial intelligence in the form of Spiking Neural Networks (SNNs) with the analog In-Memory Computing (IMC) domain, highlighting their collective potential for low-power edge computing environments. Through detailed investigation at the device, circuit, and system levels, we highlight the pivotal synergies between SNNs and IMC architectures. Additionally, we emphasize the critical need for comprehensive system-level analyses, considering the inter-dependencies between algorithms, devices, circuit & system parameters, crucial for optimal performance. An in-depth analysis leads to identification of key system-level bottlenecks arising from device limitations which can be addressed using SNN-specific algorithm-hardware co-design techniques. This review underscores the imperative for holistic device to system design space co-exploration, highlighting the critical aspects of hardware and algorithm research endeavors for low-power neuromorphic solutions.

* 19 Pages, 13 Figures

Via

Access Paper or Ask Questions

Unidirectional imaging with partially coherent light

Aug 10, 2024

Guangdong Ma, Che-Yung Shen, Jingxi Li, Luzhe Huang, Cagatay Isil, Fazil Onuralp Ardic, Xilin Yang, Yuhang Li, Yuntian Wang, Md Sadman Sakib Rahman(+1 more)

Figure 1 for Unidirectional imaging with partially coherent light

Figure 2 for Unidirectional imaging with partially coherent light

Figure 3 for Unidirectional imaging with partially coherent light

Figure 4 for Unidirectional imaging with partially coherent light

Abstract:Unidirectional imagers form images of input objects only in one direction, e.g., from field-of-view (FOV) A to FOV B, while blocking the image formation in the reverse direction, from FOV B to FOV A. Here, we report unidirectional imaging under spatially partially coherent light and demonstrate high-quality imaging only in the forward direction (A->B) with high power efficiency while distorting the image formation in the backward direction (B->A) along with low power efficiency. Our reciprocal design features a set of spatially engineered linear diffractive layers that are statistically optimized for partially coherent illumination with a given phase correlation length. Our analyses reveal that when illuminated by a partially coherent beam with a correlation length of ~1.5 w or larger, where w is the wavelength of light, diffractive unidirectional imagers achieve robust performance, exhibiting asymmetric imaging performance between the forward and backward directions - as desired. A partially coherent unidirectional imager designed with a smaller correlation length of less than 1.5 w still supports unidirectional image transmission, but with a reduced figure of merit. These partially coherent diffractive unidirectional imagers are compact (axially spanning less than 75 w), polarization-independent, and compatible with various types of illumination sources, making them well-suited for applications in asymmetric visual information processing and communication.

* 25 Pages, 8 Figures

Via

Access Paper or Ask Questions

A Simple Background Augmentation Method for Object Detection with Diffusion Model

Aug 01, 2024

Yuhang Li, Xin Dong, Chen Chen, Weiming Zhuang, Lingjuan Lyu

Figure 1 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Figure 2 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Figure 3 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Figure 4 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Abstract:In computer vision, it is well-known that a lack of data diversity will impair model performance. In this study, we address the challenges of enhancing the dataset diversity problem in order to benefit various downstream tasks such as object detection and instance segmentation. We propose a simple yet effective data augmentation approach by leveraging advancements in generative models, specifically text-to-image synthesis technologies like Stable Diffusion. Our method focuses on generating variations of labeled real images, utilizing generative object and background augmentation via inpainting to augment existing training data without the need for additional annotations. We find that background augmentation, in particular, significantly improves the models' robustness and generalization capabilities. We also investigate how to adjust the prompt and mask to ensure the generated content comply with the existing annotations. The efficacy of our augmentation techniques is validated through comprehensive evaluations of the COCO dataset and several other key object detection benchmarks, demonstrating notable enhancements in model performance across diverse scenarios. This approach offers a promising solution to the challenges of dataset enhancement, contributing to the development of more accurate and robust computer vision models.

Via

Access Paper or Ask Questions

Temporal Feature Matters: A Framework for Diffusion Model Quantization

Jul 28, 2024

Yushi Huang, Ruihao Gong, Xianglong Liu, Jing Liu, Yuhang Li, Jiwen Lu, Dacheng Tao

Abstract:The Diffusion models, widely used for image generation, face significant challenges related to their broad applicability due to prolonged inference times and high memory demands. Efficient Post-Training Quantization (PTQ) is crucial to address these issues in traditional models. Unlike those models, diffusion models critically rely on the time-step $t$ for effective multi-round denoising. Typically, $t$ from the finite set $\{1, \ldots, T\}$ is encoded into a hypersensitive temporal feature by several modules, entirely independent of the sampling data. However, existing PTQ methods do not optimize these modules individually. Instead, they employ unsuitable reconstruction objectives and complex calibration methods, leading to significant disturbances in the temporal feature and denoising trajectory. To address these challenges, we introduce a novel quantization framework: 1)~TIB-based Maintenance: Based on our innovative Temporal Information Block~(TIB) definition, Temporal Information-aware Reconstruction~(TIAR) and Finite Set Calibration~(FSC) are developed to efficiently align full precision temporal features. 2)~Cache-based Maintenance: Instead of indirect and complex optimization for the related modules, pre-computing and caching quantized counterparts of temporal features are developed to minimize errors. 3)~Disturbance-aware Selection: Employ temporal feature errors to guide a fine-grained selection for superior maintenance. This framework preserves most of the temporal information and ensures high-quality end-to-end generation. Extensive testing on various datasets and diffusion models confirms our superior results. Notably, our approach closely matches the performance of the full-precision model under 4-bit quantization. Furthermore, the quantized SD-XL model achieves hardware acceleration of 2.20$\times$ on CPU and 5.76$\times$ on GPU demonstrating its efficiency.

* arXiv admin note: substantial text overlap with arXiv:2311.16503

Via

Access Paper or Ask Questions