Alert button
Picture for Diana Marculescu

Diana Marculescu

Alert button

SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations

Sep 04, 2023
Tanvir Mahmud, Chun-Hao Liu, Burhaneddin Yaman, Diana Marculescu

Figure 1 for SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations
Figure 2 for SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations
Figure 3 for SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations
Figure 4 for SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations

Despite significant progress in semi-supervised learning for image object detection, several key issues are yet to be addressed for video object detection: (1) Achieving good performance for supervised video object detection greatly depends on the availability of annotated frames. (2) Despite having large inter-frame correlations in a video, collecting annotations for a large number of frames per video is expensive, time-consuming, and often redundant. (3) Existing semi-supervised techniques on static images can hardly exploit the temporal motion dynamics inherently present in videos. In this paper, we introduce SSVOD, an end-to-end semi-supervised video object detection framework that exploits motion dynamics of videos to utilize large-scale unlabeled frames with sparse annotations. To selectively assemble robust pseudo-labels across groups of frames, we introduce \textit{flow-warped predictions} from nearby frames for temporal-consistency estimation. In particular, we introduce cross-IoU and cross-divergence based selection methods over a set of estimated predictions to include robust pseudo-labels for bounding boxes and class labels, respectively. To strike a balance between confirmation bias and uncertainty noise in pseudo-labels, we propose confidence threshold based combination of hard and soft pseudo-labels. Our method achieves significant performance improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS datasets. Code and pre-trained models will be released.

Viaarxiv icon

Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers

Aug 21, 2023
Natalia Frumkin, Dibakar Gope, Diana Marculescu

Quantization scale and bit-width are the most important parameters when considering how to quantize a neural network. Prior work focuses on optimizing quantization scales in a global manner through gradient methods (gradient descent \& Hessian analysis). Yet, when applying perturbations to quantization scales, we observe a very jagged, highly non-smooth test loss landscape. In fact, small perturbations in quantization scale can greatly affect accuracy, yielding a $0.5-0.8\%$ accuracy boost in 4-bit quantized vision transformers (ViTs). In this regime, gradient methods break down, since they cannot reliably reach local minima. In our work, dubbed Evol-Q, we use evolutionary search to effectively traverse the non-smooth landscape. Additionally, we propose using an infoNCE loss, which not only helps combat overfitting on the small calibration dataset ($1,000$ images) but also makes traversing such a highly non-smooth surface easier. Evol-Q improves the top-1 accuracy of a fully quantized ViT-Base by $10.30\%$, $0.78\%$, and $0.15\%$ for $3$-bit, $4$-bit, and $8$-bit weight quantization levels. Extensive experiments on a variety of CNN and ViT architectures further demonstrate its robustness in extreme quantization scenarios. Our code is available at https://github.com/enyac-group/evol-q

* arXiv admin note: text overlap with arXiv:2211.09643 
Viaarxiv icon

MobileTL: On-device Transfer Learning with Inverted Residual Blocks

Dec 05, 2022
Hung-Yueh Chiang, Natalia Frumkin, Feng Liang, Diana Marculescu

Figure 1 for MobileTL: On-device Transfer Learning with Inverted Residual Blocks
Figure 2 for MobileTL: On-device Transfer Learning with Inverted Residual Blocks
Figure 3 for MobileTL: On-device Transfer Learning with Inverted Residual Blocks
Figure 4 for MobileTL: On-device Transfer Learning with Inverted Residual Blocks

Transfer learning on edge is challenging due to on-device limited resources. Existing work addresses this issue by training a subset of parameters or adding model patches. Developed with inference in mind, Inverted Residual Blocks (IRBs) split a convolutional layer into depthwise and pointwise convolutions, leading to more stacking layers, e.g., convolution, normalization, and activation layers. Though they are efficient for inference, IRBs require that additional activation maps are stored in memory for training weights for convolution layers and scales for normalization layers. As a result, their high memory cost prohibits training IRBs on resource-limited edge devices, and making them unsuitable in the context of transfer learning. To address this issue, we present MobileTL, a memory and computationally efficient on-device transfer learning method for models built with IRBs. MobileTL trains the shifts for internal normalization layers to avoid storing activation maps for the backward pass. Also, MobileTL approximates the backward computation of the activation layer (e.g., Hard-Swish and ReLU6) as a signed function which enables storing a binary mask instead of activation maps for the backward pass. MobileTL fine-tunes a few top blocks (close to output) rather than propagating the gradient through the whole network to reduce the computation cost. Our method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs, respectively. For MobileNetV3, we observe a 36% reduction in floating-point operations (FLOPs) when fine-tuning 5 blocks, while only incurring a 0.6% accuracy reduction on CIFAR10. Extensive experiments on multiple datasets demonstrate that our method is Pareto-optimal (best accuracy under given hardware constraints) compared to prior work in transfer learning for edge devices.

Viaarxiv icon

CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers

Nov 17, 2022
Natalia Frumkin, Dibakar Gope, Diana Marculescu

Figure 1 for CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers
Figure 2 for CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers
Figure 3 for CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers
Figure 4 for CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers

When considering post-training quantization, prior work has typically focused on developing a mixed precision scheme or learning the best way to partition a network for quantization. In our work, CPT-V, we look at a general way to improve the accuracy of networks that have already been quantized, simply by perturbing the quantization scales. Borrowing the idea of contrastive loss from self-supervised learning, we find a robust way to jointly minimize a loss function using just 1,000 calibration images. In order to determine the best performing quantization scale, CPT-V contrasts the features of quantized and full precision models in a self-supervised fashion. Unlike traditional reconstruction-based loss functions, the use of a contrastive loss function not only rewards similarity between the quantized and full precision outputs but also helps in distinguishing the quantized output from other outputs within a given batch. In addition, in contrast to prior works, CPT-V proposes a block-wise evolutionary search to minimize a global contrastive loss objective, allowing for accuracy improvement of existing vision transformer (ViT) quantization schemes. For example, CPT-V improves the top-1 accuracy of a fully quantized ViT-Base by 10.30%, 0.78%, and 0.15% for 3-bit, 4-bit, and 8-bit weight quantization levels. Extensive experiments on a variety of other ViT architectures further demonstrate its robustness in extreme quantization scenarios. Our code is available at <link>.

Viaarxiv icon

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

Oct 11, 2022
Tanvir Mahmud, Diana Marculescu

Figure 1 for AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization
Figure 2 for AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization
Figure 3 for AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization
Figure 4 for AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

An audio-visual event (AVE) is denoted by the correspondence of the visual and auditory signals in a video segment. Precise localization of the AVEs is very challenging since it demands effective multi-modal feature correspondence to ground the short and long range temporal interactions. Existing approaches struggle in capturing the different scales of multi-modal interaction due to ineffective multi-modal training strategies. To overcome this limitation, we introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer to effectively operate on different temporal scales of video frames. Our contributions are three-fold: (1) We introduce a multi-stage training framework to incorporate AudioCLIP pre-trained with audio-image pairs into the AVE localization task on video frames through contrastive fine-tuning, effective mean video feature extraction, and multi-scale training phases. (2) We propose a multi-domain attention mechanism that operates on both temporal and feature domains over varying timescales to fuse the local and global feature variations. (3) We introduce a temporal refining scheme with event-guided attention followed by a simple-yet-effective post processing step to handle significant variations of the background over diverse events. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement which proves its superiority over existing approaches.

* Accepted in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023 (10 Pages, 5 Figures) 
Viaarxiv icon

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Oct 09, 2022
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, Diana Marculescu

Figure 1 for Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Figure 2 for Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Figure 3 for Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Figure 4 for Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset-specific adaptations.

* Project page: https://jeff-liangf.github.io/projects/ovseg 
Viaarxiv icon

QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration

Jun 30, 2022
Ahmet Inci, Siri Garudanagiri Virupaksha, Aman Jain, Ting-Wu Chin, Venkata Vivek Thallam, Ruizhou Ding, Diana Marculescu

Figure 1 for QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration
Figure 2 for QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration
Figure 3 for QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration
Figure 4 for QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration

As the machine learning and systems communities strive to achieve higher energy-efficiency through custom deep neural network (DNN) accelerators, varied precision or quantization levels, and model compression techniques, there is a need for design space exploration frameworks that incorporate quantization-aware processing elements into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QUIDAM, a highly parameterized quantization-aware DNN accelerator and model co-exploration framework. Our framework can facilitate future research on design space exploration of DNN accelerators for various design choices such as bit precision, processing element type, scratchpad sizes of processing elements, global buffer size, number of total processing elements, and DNN configurations. Our results show that different bit precisions and processing element types lead to significant differences in terms of performance per area and energy. Specifically, our framework identifies a wide range of design points where performance per area and energy varies more than 5x and 35x, respectively. With the proposed framework, we show that lightweight processing elements achieve on par accuracy results and up to 5.7x more performance per area and energy improvement when compared to the best INT16 based implementation. Finally, due to the efficiency of the pre-characterized power, performance, and area models, QUIDAM can speed up the design exploration process by 3-4 orders of magnitude as it removes the need for expensive synthesis and characterization of each design.

* 25 pages, 12 figures. arXiv admin note: substantial text overlap with arXiv:2205.13045, arXiv:2205.08648 
Viaarxiv icon

Efficient Deep Learning Using Non-Volatile Memory Technology

Jun 27, 2022
Ahmet Inci, Mehmet Meric Isgenc, Diana Marculescu

Figure 1 for Efficient Deep Learning Using Non-Volatile Memory Technology
Figure 2 for Efficient Deep Learning Using Non-Volatile Memory Technology
Figure 3 for Efficient Deep Learning Using Non-Volatile Memory Technology
Figure 4 for Efficient Deep Learning Using Non-Volatile Memory Technology

Embedded machine learning (ML) systems have now become the dominant platform for deploying ML serving tasks and are projected to become of equal importance for training ML models. With this comes the challenge of overall efficient deployment, in particular low power and high throughput implementations, under stringent memory constraints. In this context, non-volatile memory (NVM) technologies such as STT-MRAM and SOT-MRAM have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While prior work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM++, a comprehensive framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. DeepNVM++ relies on iso-capacity and iso-area performance and energy models for last-level caches implemented using conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2.2x and 2.4x EDP reduction and accommodate 2.3x and 3.3x cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. DeepNVM++ is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.

* This article will appear as a book chapter in the book titled "Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing". arXiv admin note: substantial text overlap with arXiv:2012.04559 
Viaarxiv icon

Play It Cool: Dynamic Shifting Prevents Thermal Throttling

Jun 22, 2022
Yang Zhou, Feng Liang, Ting-wu Chin, Diana Marculescu

Figure 1 for Play It Cool: Dynamic Shifting Prevents Thermal Throttling
Figure 2 for Play It Cool: Dynamic Shifting Prevents Thermal Throttling
Figure 3 for Play It Cool: Dynamic Shifting Prevents Thermal Throttling
Figure 4 for Play It Cool: Dynamic Shifting Prevents Thermal Throttling

Machine learning (ML) has entered the mobile era where an enormous number of ML models are deployed on edge devices. However, running common ML models on edge devices continuously may generate excessive heat from the computation, forcing the device to "slow down" to prevent overheating, a phenomenon called thermal throttling. This paper studies the impact of thermal throttling on mobile phones: when it occurs, the CPU clock frequency is reduced, and the model inference latency may increase dramatically. This unpleasant inconsistent behavior has a substantial negative effect on user experience, but it has been overlooked for a long time. To counter thermal throttling, we propose to utilize dynamic networks with shared weights and dynamically shift between large and small ML models seamlessly according to their thermal profile, i.e., shifting to a small model when the system is about to throttle. With the proposed dynamic shifting, the application runs consistently without experiencing CPU clock frequency degradation and latency increase. In addition, we also study the resulting accuracy when dynamic shifting is deployed and show that our approach provides a reasonable trade-off between model latency and model accuracy.

* ICML DyNN Workshop 2022 Spotlight 
Viaarxiv icon

SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners

May 28, 2022
Feng Liang, Yangguang Li, Diana Marculescu

Figure 1 for SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners
Figure 2 for SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners
Figure 3 for SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners
Figure 4 for SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners

Self-supervised Masked Autoencoders (MAE) are emerging as a new pre-training paradigm in computer vision. MAE learns semantics implicitly via reconstructing local patches, requiring thousands of pre-training epochs to achieve favorable performance. This paper incorporates explicit supervision, i.e., golden labels, into the MAE framework. The proposed Supervised MAE (SupMAE) only exploits a visible subset of image patches for classification, unlike the standard supervised pre-training where all image patches are used. SupMAE is efficient and can achieve comparable performance with MAE using only 30% compute when evaluated on ImageNet with the ViT-B/16 model. Detailed ablation studies are conducted to verify the proposed components.

* Technical report. Codes are available 
Viaarxiv icon