Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vishal Chudasama

Open-Set Object Detection By Aligning Known Class Representations

Dec 30, 2024

Hiran Sarkar, Vishal Chudasama, Naoyuki Onoe, Pankaj Wasnik, Vineeth N Balasubramanian

Figure 1 for Open-Set Object Detection By Aligning Known Class Representations

Figure 2 for Open-Set Object Detection By Aligning Known Class Representations

Figure 3 for Open-Set Object Detection By Aligning Known Class Representations

Figure 4 for Open-Set Object Detection By Aligning Known Class Representations

Abstract:Open-Set Object Detection (OSOD) has emerged as a contemporary research direction to address the detection of unknown objects. Recently, few works have achieved remarkable performance in the OSOD task by employing contrastive clustering to separate unknown classes. In contrast, we propose a new semantic clustering-based approach to facilitate a meaningful alignment of clusters in semantic space and introduce a class decorrelation module to enhance inter-cluster separation. Our approach further incorporates an object focus module to predict objectness scores, which enhances the detection of unknown objects. Further, we employ i) an evaluation technique that penalizes low-confidence outputs to mitigate the risk of misclassification of the unknown objects and ii) a new metric called HMP that combines known and unknown precision using harmonic mean. Our extensive experiments demonstrate that the proposed model achieves significant improvement on the MS-COCO & PASCAL VOC dataset for the OSOD task.

* Accepted to WACV'24

Via

Access Paper or Ask Questions

Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

Dec 29, 2024

Ayush Ghadiya, Purbayan Kar, Vishal Chudasama, Pankaj Wasnik

Figure 1 for Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

Figure 2 for Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

Figure 3 for Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

Figure 4 for Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

Abstract:Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.

* Accepted to CVPR'24 MULA Workshop

Via

Access Paper or Ask Questions

Beyond Few-shot Object Detection: A Detailed Survey

Aug 26, 2024

Vishal Chudasama, Hiran Sarkar, Pankaj Wasnik, Vineeth N Balasubramanian, Jayateja Kalla

Abstract:Object detection is a critical field in computer vision focusing on accurately identifying and locating specific objects in images or videos. Traditional methods for object detection rely on large labeled training datasets for each object category, which can be time-consuming and expensive to collect and annotate. To address this issue, researchers have introduced few-shot object detection (FSOD) approaches that merge few-shot learning and object detection principles. These approaches allow models to quickly adapt to new object categories with only a few annotated samples. While traditional FSOD methods have been studied before, this survey paper comprehensively reviews FSOD research with a specific focus on covering different FSOD settings such as standard FSOD, generalized FSOD, incremental FSOD, open-set FSOD, and domain adaptive FSOD. These approaches play a vital role in reducing the reliance on extensive labeled datasets, particularly as the need for efficient machine learning models continues to rise. This survey paper aims to provide a comprehensive understanding of the above-mentioned few-shot settings and explore the methodologies for each FSOD task. It thoroughly compares state-of-the-art methods across different FSOD settings, analyzing them in detail based on their evaluation protocols. Additionally, it offers insights into their applications, challenges, and potential future directions in the evolving field of object detection with limited data.

* 43 pages, 8 figures

Via

Access Paper or Ask Questions

Fiducial Focus Augmentation for Facial Landmark Detection

Feb 23, 2024

Purbayan Kar, Vishal Chudasama, Naoyuki Onoe, Pankaj Wasnik, Vineeth Balasubramanian

Figure 1 for Fiducial Focus Augmentation for Facial Landmark Detection

Figure 2 for Fiducial Focus Augmentation for Facial Landmark Detection

Figure 3 for Fiducial Focus Augmentation for Facial Landmark Detection

Figure 4 for Fiducial Focus Augmentation for Facial Landmark Detection

Abstract:Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model's inability to effectively acquire appropriate facial structure information from the input images. To address this, we propose a novel image augmentation technique specifically designed for the FLD task to enhance the model's understanding of facial structures. To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. Furthermore, we employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Extensive experiments show that our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.

* Accepted to BMVC'23

Via

Access Paper or Ask Questions

Revisiting Class Imbalance for End-to-end Semi-Supervised Object Detection

Jun 04, 2023

Purbayan Kar, Vishal Chudasama, Naoyuki Onoe, Pankaj Wasnik

Figure 1 for Revisiting Class Imbalance for End-to-end Semi-Supervised Object Detection

Figure 2 for Revisiting Class Imbalance for End-to-end Semi-Supervised Object Detection

Figure 3 for Revisiting Class Imbalance for End-to-end Semi-Supervised Object Detection

Figure 4 for Revisiting Class Imbalance for End-to-end Semi-Supervised Object Detection

Abstract:Semi-supervised object detection (SSOD) has made significant progress with the development of pseudo-label-based end-to-end methods. However, many of these methods face challenges due to class imbalance, which hinders the effectiveness of the pseudo-label generator. Furthermore, in the literature, it has been observed that low-quality pseudo-labels severely limit the performance of SSOD. In this paper, we examine the root causes of low-quality pseudo-labels and present novel learning mechanisms to improve the label generation quality. To cope with high false-negative and low precision rates, we introduce an adaptive thresholding mechanism that helps the proposed network to filter out optimal bounding boxes. We further introduce a Jitter-Bagging module to provide accurate information on localization to help refine the bounding boxes. Additionally, two new losses are introduced using the background and foreground scores predicted by the teacher and student networks to improvise the pseudo-label recall rate. Furthermore, our method applies strict supervision to the teacher network by feeding strong & weak augmented data to generate robust pseudo-labels so that it can detect small and complex objects. Finally, the extensive experiments show that the proposed network outperforms state-of-the-art methods on MS-COCO and Pascal VOC datasets and allows the baseline network to achieve 100% supervised performance with much less (i.e., 20%) labeled data.

* Accepted at the Efficient Deep Learning for Computer Vision Workshop, CVPR 2023

Via

Access Paper or Ask Questions

M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

Jun 05, 2022

Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Naoyuki Onoe

Figure 1 for M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

Figure 2 for M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

Figure 3 for M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

Figure 4 for M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

Abstract:Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-the-art performance in ERC.

* Accepted for publication in the 5th Multimodal Learning and Applications (MULA) Workshop at CVPR 2022

Via

Access Paper or Ask Questions

Learned Smartphone ISP on Mobile NPUs with Deep Learning, Mobile AI 2021 Challenge: Report

May 17, 2021

Andrey Ignatov, Cheng-Ming Chiang, Hsien-Kai Kuo, Anastasia Sycheva, Radu Timofte, Min-Hung Chen, Man-Yu Lee, Yu-Syuan Xu, Yu Tseng, Shusong Xu(+31 more)

Figure 1 for Learned Smartphone ISP on Mobile NPUs with Deep Learning, Mobile AI 2021 Challenge: Report

Figure 2 for Learned Smartphone ISP on Mobile NPUs with Deep Learning, Mobile AI 2021 Challenge: Report

Figure 3 for Learned Smartphone ISP on Mobile NPUs with Deep Learning, Mobile AI 2021 Challenge: Report

Figure 4 for Learned Smartphone ISP on Mobile NPUs with Deep Learning, Mobile AI 2021 Challenge: Report

Abstract:As the quality of mobile cameras starts to play a crucial role in modern smartphones, more and more attention is now being paid to ISP algorithms used to improve various perceptual aspects of mobile photos. In this Mobile AI challenge, the target was to develop an end-to-end deep learning-based image signal processing (ISP) pipeline that can replace classical hand-crafted ISPs and achieve nearly real-time performance on smartphone NPUs. For this, the participants were provided with a novel learned ISP dataset consisting of RAW-RGB image pairs captured with the Sony IMX586 Quad Bayer mobile sensor and a professional 102-megapixel medium format camera. The runtime of all models was evaluated on the MediaTek Dimensity 1000+ platform with a dedicated AI processing unit capable of accelerating both floating-point and quantized neural networks. The proposed solutions are fully compatible with the above NPU and are capable of processing Full HD photos under 60-100 milliseconds while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper.

* Mobile AI 2021 Workshop and Challenges: https://ai-benchmark.com/workshops/mai/2021/

Via

Access Paper or Ask Questions