Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chetan Arora

Can Reasons Help Improve Pedestrian Intent Estimation? A Cross-Modal Approach

Nov 20, 2024

Vaishnavi Khindkar, Vineeth Balasubramanian, Chetan Arora, Anbumani Subramanian, C. V. Jawahar

Figure 1 for Can Reasons Help Improve Pedestrian Intent Estimation? A Cross-Modal Approach

Figure 2 for Can Reasons Help Improve Pedestrian Intent Estimation? A Cross-Modal Approach

Figure 3 for Can Reasons Help Improve Pedestrian Intent Estimation? A Cross-Modal Approach

Figure 4 for Can Reasons Help Improve Pedestrian Intent Estimation? A Cross-Modal Approach

Abstract:With the increased importance of autonomous navigation systems has come an increasing need to protect the safety of Vulnerable Road Users (VRUs) such as pedestrians. Predicting pedestrian intent is one such challenging task, where prior work predicts the binary cross/no-cross intention with a fusion of visual and motion features. However, there has been no effort so far to hedge such predictions with human-understandable reasons. We address this issue by introducing a novel problem setting of exploring the intuitive reasoning behind a pedestrian's intent. In particular, we show that predicting the 'WHY' can be very useful in understanding the 'WHAT'. To this end, we propose a novel, reason-enriched PIE++ dataset consisting of multi-label textual explanations/reasons for pedestrian intent. We also introduce a novel multi-task learning framework called MINDREAD, which leverages a cross-modal representation learning framework for predicting pedestrian intent as well as the reason behind the intent. Our comprehensive experiments show significant improvement of 5.6% and 7% in accuracy and F1-score for the task of intent prediction on the PIE++ dataset using MINDREAD. We also achieved a 4.4% improvement in accuracy on a commonly used JAAD dataset. Extensive evaluation using quantitative/qualitative metrics and user studies shows the effectiveness of our approach.

Via

Access Paper or Ask Questions

Towards Global Localization using Multi-Modal Object-Instance Re-Identification

Sep 18, 2024

Aneesh Chavan, Vaibhav Agrawal, Vineeth Bhat, Sarthak Chittawar, Siddharth Srivastava, Chetan Arora, K Madhava Krishna

Abstract:Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance re-identification transformer architecture that integrates multimodal RGB and depth information. By leveraging depth data, we demonstrate improvements in ReID across scenes that are cluttered or have varying illumination conditions. Additionally, we develop a ReID-based localization framework that enables accurate camera localization and pose identification across different viewpoints. We validate our methods using two custom-built RGB-D datasets, as well as multiple sequences from the open-source TUM RGB-D datasets. Our approach demonstrates significant improvements in both object instance ReID (mAP of 75.18) and localization accuracy (success rate of 83% on TUM-RGBD), highlighting the essential role of object ReID in advancing robotic perception. Our models, frameworks, and datasets have been made publicly available.

* 8 pages, 5 figures, 3 tables. Submitted to ICRA 2025

Via

Access Paper or Ask Questions

D-MASTER: Mask Annealed Transformer for Unsupervised Domain Adaptation in Breast Cancer Detection from Mammograms

Jul 09, 2024

Tajamul Ashraf, Krithika Rangarajan, Mohit Gambhir, Richa Gabha, Chetan Arora

Figure 1 for D-MASTER: Mask Annealed Transformer for Unsupervised Domain Adaptation in Breast Cancer Detection from Mammograms

Figure 2 for D-MASTER: Mask Annealed Transformer for Unsupervised Domain Adaptation in Breast Cancer Detection from Mammograms

Figure 3 for D-MASTER: Mask Annealed Transformer for Unsupervised Domain Adaptation in Breast Cancer Detection from Mammograms

Figure 4 for D-MASTER: Mask Annealed Transformer for Unsupervised Domain Adaptation in Breast Cancer Detection from Mammograms

Abstract:We focus on the problem of Unsupervised Domain Adaptation (\uda) for breast cancer detection from mammograms (BCDM) problem. Recent advancements have shown that masked image modeling serves as a robust pretext task for UDA. However, when applied to cross-domain BCDM, these techniques struggle with breast abnormalities such as masses, asymmetries, and micro-calcifications, in part due to the typically much smaller size of region of interest in comparison to natural images. This often results in more false positives per image (FPI) and significant noise in pseudo-labels typically used to bootstrap such techniques. Recognizing these challenges, we introduce a transformer-based Domain-invariant Mask Annealed Student Teacher autoencoder (D-MASTER) framework. D-MASTER adaptively masks and reconstructs multi-scale feature maps, enhancing the model's ability to capture reliable target domain features. D-MASTER also includes adaptive confidence refinement to filter pseudo-labels, ensuring only high-quality detections are considered. We also provide a bounding box annotated subset of 1000 mammograms from the RSNA Breast Screening Dataset (referred to as RSNA-BSD1K) to support further research in BCDM. We evaluate D-MASTER on multiple BCDM datasets acquired from diverse domains. Experimental results show a significant improvement of 9% and 13% in sensitivity at 0.3 FPI over state-of-the-art UDA techniques on publicly available benchmark INBreast and DDSM datasets respectively. We also report an improvement of 11% and 17% on In-house and RSNA-BSD1K datasets respectively. The source code, pre-trained D-MASTER model, along with RSNA-BSD1K dataset annotations is available at https://dmaster-iitd.github.io/webpage.

Via

Access Paper or Ask Questions

The Third Monocular Depth Estimation Challenge

Apr 27, 2024

Jaime Spencer, Fabio Tosi, Matteo Poggi, Ripudaman Singh Arora, Chris Russell, Simon Hadfield, Richard Bowden, GuangYuan Zhou, ZhengXin Li, Qiang Rao(+31 more)

Figure 1 for The Third Monocular Depth Estimation Challenge

Figure 2 for The Third Monocular Depth Estimation Challenge

Figure 3 for The Third Monocular Depth Estimation Challenge

Figure 4 for The Third Monocular Depth Estimation Challenge

Abstract:This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.

* To appear in CVPRW2024

Via

Access Paper or Ask Questions

Model Generation from Requirements with LLMs: an Exploratory Study

Apr 09, 2024

Alessio Ferrari, Sallam Abualhaija, Chetan Arora

Abstract:Complementing natural language (NL) requirements with graphical models can improve stakeholders' communication and provide directions for system design. However, creating models from requirements involves manual effort. The advent of generative large language models (LLMs), ChatGPT being a notable example, offers promising avenues for automated assistance in model generation. This paper investigates the capability of ChatGPT to generate a specific type of model, i.e., UML sequence diagrams, from NL requirements. We conduct a qualitative study in which we examine the sequence diagrams generated by ChatGPT for 28 requirements documents of various types and from different domains. Observations from the analysis of the generated diagrams have systematically been captured through evaluation logs, and categorized through thematic analysis. Our results indicate that, although the models generally conform to the standard and exhibit a reasonable level of understandability, their completeness and correctness with respect to the specified requirements often present challenges. This issue is particularly pronounced in the presence of requirements smells, such as ambiguity and inconsistency. The insights derived from this study can influence the practical utilization of LLMs in the RE process, and open the door to novel RE-specific prompting strategies targeting effective model generation.

Via

Access Paper or Ask Questions

ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

Apr 01, 2024

Suraj Patni, Aradhye Agarwal, Chetan Arora

Figure 1 for ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

Figure 2 for ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

Figure 3 for ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

Figure 4 for ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

Abstract:In the absence of parallax cues, a learning-based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive, it is necessary to train such models on large and varied datasets, which are difficult to capture. It has been shown that using embeddings from pre-trained foundational models, such as CLIP, improves zero shot transfer in several applications. Taking inspiration from this, in our paper we explore the use of global image priors generated from a pre-trained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model, pre-trained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings. Based on this idea, we propose a new SIDE model using a diffusion backbone which is conditioned on ViT embeddings. Our proposed design establishes a new state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of 0.059(14% improvement) compared to 0.069 by the current SOTA (VPD). And on KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to 0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%) over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%, 18%, 45%, 9%) by ZoeDepth. The project page is available at https://ecodepth-iitd.github.io

* IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

Via

Access Paper or Ask Questions

FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders

Mar 29, 2024

Soumen Basu, Mayuna Gupta, Chetan Madan, Pankaj Gupta, Chetan Arora

Abstract:In recent years, automated Gallbladder Cancer (GBC) detection has gained the attention of researchers. Current state-of-the-art (SOTA) methodologies relying on ultrasound sonography (US) images exhibit limited generalization, emphasizing the need for transformative approaches. We observe that individual US frames may lack sufficient information to capture disease manifestation. This study advocates for a paradigm shift towards video-based GBC detection, leveraging the inherent advantages of spatiotemporal representations. Employing the Masked Autoencoder (MAE) for representation learning, we address shortcomings in conventional image-based methods. We propose a novel design called FocusMAE to systematically bias the selection of masking tokens from high-information regions, fostering a more refined representation of malignancy. Additionally, we contribute the most extensive US video dataset for GBC detection. We also note that, this is the first study on US video-based GBC detection. We validate the proposed methods on the curated dataset, and report a new state-of-the-art (SOTA) accuracy of 96.4% for the GBC detection problem, against an accuracy of 84% by current Image-based SOTA - GBCNet, and RadFormer, and 94.7% by Video-based SOTA - AdaMAE. We further demonstrate the generality of the proposed FocusMAE on a public CT-based Covid detection dataset, reporting an improvement in accuracy by 3.3% over current baselines. The source code and pretrained models are available at: https://gbc-iitd.github.io/focusmae

* To Appear at CVPR 2024

Via

Access Paper or Ask Questions

A Study of Fairness Concerns in AI-based Mobile App Reviews

Jan 16, 2024

Ali Rezaei Nasab, Maedeh Dashti, Mojtaba Shahin, Mansooreh Zahedi, Hourieh Khalajzadeh, Chetan Arora, Peng Liang

Figure 1 for A Study of Fairness Concerns in AI-based Mobile App Reviews

Figure 2 for A Study of Fairness Concerns in AI-based Mobile App Reviews

Figure 3 for A Study of Fairness Concerns in AI-based Mobile App Reviews

Figure 4 for A Study of Fairness Concerns in AI-based Mobile App Reviews

Abstract:With the growing application of AI-based systems in our lives and society, there is a rising need to ensure that AI-based systems are developed and used in a responsible way. Fairness is one of the socio-technical concerns that must be addressed in AI-based systems for this purpose. Unfair AI-based systems, particularly, unfair AI-based mobile apps, can pose difficulties for a significant proportion of the global populace. This paper aims to deeply analyze fairness concerns in AI-based app reviews. We first manually constructed a ground-truth dataset including a statistical sample of fairness and non-fairness reviews. Leveraging the ground-truth dataset, we then developed and evaluated a set of machine learning and deep learning classifiers that distinguish fairness reviews from non-fairness reviews. Our experiments show that our best-performing classifier can detect fairness reviews with a precision of 94%. We then applied the best-performing classifier on approximately 9.5M reviews collected from 108 AI-based apps and identified around 92K fairness reviews. While the fairness reviews appear in 23 app categories, we found that the 'communication' and 'social' app categories have the highest percentage of fairness reviews. Next, applying the K-means clustering technique to the 92K fairness reviews, followed by manual analysis, led to the identification of six distinct types of fairness concerns (e.g., 'receiving different quality of features and services in different platforms and devices' and 'lack of transparency and fairness in dealing with user-generated content'). Finally, the manual analysis of 2,248 app owners' responses to the fairness reviews identified six root causes (e.g., 'copyright issues', 'external factors', 'development cost') that app owners report to justify fairness concerns.

* 15 pages, 4 images, 2 tables, Manuscript submitted to a Journal (2024)

Via

Access Paper or Ask Questions

Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection

Nov 08, 2023

Akshit Jindal, Vikram Goyal, Saket Anand, Chetan Arora

Figure 1 for Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection

Figure 2 for Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection

Figure 3 for Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection

Figure 4 for Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection

Abstract:Machine Learning (ML) models become vulnerable to Model Stealing Attacks (MSA) when they are deployed as a service. In such attacks, the deployed model is queried repeatedly to build a labelled dataset. This dataset allows the attacker to train a thief model that mimics the original model. To maximize query efficiency, the attacker has to select the most informative subset of data points from the pool of available data. Existing attack strategies utilize approaches like Active Learning and Semi-Supervised learning to minimize costs. However, in the black-box setting, these approaches may select sub-optimal samples as they train only one thief model. Depending on the thief model's capacity and the data it was pretrained on, the model might even select noisy samples that harm the learning process. In this work, we explore the usage of an ensemble of deep learning models as our thief model. We call our attack Army of Thieves(AOT) as we train multiple models with varying complexities to leverage the crowd's wisdom. Based on the ensemble's collective decision, uncertain samples are selected for querying, while the most confident samples are directly included in the training data. Our approach is the first one to utilize an ensemble of thief models to perform model extraction. We outperform the base approaches of existing state-of-the-art methods by at least 3% and achieve a 21% higher adversarial sample transferability than previous work for models trained on the CIFAR-10 dataset.

* 10 pages, 5 figures, paper accepted to WACV 2024

Via

Access Paper or Ask Questions

United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Nov 06, 2023

Siddhant Bansal, Chetan Arora, C. V. Jawahar

Figure 1 for United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Figure 2 for United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Figure 3 for United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Figure 4 for United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Abstract:Given multiple videos of the same task, procedure learning addresses identifying the key-steps and determining their order to perform the task. For this purpose, existing approaches use the signal generated from a pair of videos. This makes key-steps discovery challenging as the algorithms lack inter-videos perspective. Instead, we propose an unsupervised Graph-based Procedure Learning (GPL) framework. GPL consists of the novel UnityGraph that represents all the videos of a task as a graph to obtain both intra-video and inter-videos context. Further, to obtain similar embeddings for the same key-steps, the embeddings of UnityGraph are updated in an unsupervised manner using the Node2Vec algorithm. Finally, to identify the key-steps, we cluster the embeddings using KMeans. We test GPL on benchmark ProceL, CrossTask, and EgoProceL datasets and achieve an average improvement of 2% on third-person datasets and 3.6% on EgoProceL over the state-of-the-art.

* 13 pages, 6 figures, Accepted in Winter Conference on Applications of Computer Vision (WACV), 2024

Via

Access Paper or Ask Questions