Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Venu Govindaraju

SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Jan 12, 2025

Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj Setlur, Venu Govindaraju

Figure 1 for SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Figure 2 for SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Figure 3 for SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Figure 4 for SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Abstract:Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.

* Paper accepted at WACV 2025 in round 1

Via

Access Paper or Ask Questions

RealCQA-V2 : Visual Premise Proving

Oct 29, 2024

Saleem Ahmed, Rangaraj Setlur, Venu Govindaraju

Abstract:We introduce Visual Premise Proving (VPP), a novel task tailored to refine the process of chart question answering by deconstructing it into a series of logical premises. Each of these premises represents an essential step in comprehending a chart's content and deriving logical conclusions, thereby providing a granular look at a model's reasoning abilities. This approach represents a departure from conventional accuracy-based evaluation methods, emphasizing the model's ability to sequentially validate each premise and ideally mimic human analytical processes. A model adept at reasoning is expected to demonstrate proficiency in both data retrieval and the structural understanding of charts, suggesting a synergy between these competencies. However, in our zero-shot study using the sophisticated MATCHA model on a scientific chart question answering dataset, an intriguing pattern emerged. The model showcased superior performance in chart reasoning (27\%) over chart structure (19\%) and data retrieval (14\%). This performance gap suggests that models might more readily generalize reasoning capabilities across datasets, benefiting from consistent mathematical and linguistic semantics, even when challenged by changes in the visual domain that complicate structure comprehension and data retrieval. Furthermore, the efficacy of using accuracy of binary QA for evaluating chart reasoning comes into question if models can deduce correct answers without parsing chart data or structure. VPP highlights the importance of integrating reasoning with visual comprehension to enhance model performance in chart analysis, pushing for a balanced approach in evaluating visual data interpretation capabilities.

* Under Review : Code and Data will be made public soon

Via

Access Paper or Ask Questions

Ig3D: Integrating 3D Face Representations in Facial Expression Inference

Aug 29, 2024

Lu Dong, Xiao Wang, Srirangaraj Setlur, Venu Govindaraju, Ifeoma Nwogu

Abstract:Reconstructing 3D faces with facial geometry from single images has allowed for major advances in animation, generative models, and virtual reality. However, this ability to represent faces with their 3D features is not as fully explored by the facial expression inference (FEI) community. This study therefore aims to investigate the impacts of integrating such 3D representations into the FEI task, specifically for facial expression classification and face-based valence-arousal (VA) estimation. To accomplish this, we first assess the performance of two 3D face representations (both based on the 3D morphable model, FLAME) for the FEI tasks. We further explore two fusion architectures, intermediate fusion and late fusion, for integrating the 3D face representations with existing 2D inference frameworks. To evaluate our proposed architecture, we extract the corresponding 3D representations and perform extensive tests on the AffectNet and RAF-DB datasets. Our experimental results demonstrate that our proposed method outperforms the state-of-the-art AffectNet VA estimation and RAF-DB classification tasks. Moreover, our method can act as a complement to other existing methods to boost performance in many emotion inference tasks.

* Accepted by ECCVW 2024

Via

Access Paper or Ask Questions

Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos

Aug 20, 2024

Dennis Fedorishin, Lie Lu, Srirangaraj Setlur, Venu Govindaraju

Abstract:A "match cut" is a common video editing technique where a pair of shots that have a similar composition transition fluidly from one to another. Although match cuts are often visual, certain match cuts involve the fluid transition of audio, where sounds from different sources merge into one indistinguishable transition between two shots. In this paper, we explore the ability to automatically find and create "audio match cuts" within videos and movies. We create a self-supervised audio representation for audio match cutting and develop a coarse-to-fine audio match pipeline that recommends matching shots and creates the blended audio. We further annotate a dataset for the proposed audio match cut task and compare the ability of multiple audio representations to find audio match cut candidates. Finally, we evaluate multiple methods to blend two matching audio candidates with the goal of creating a smooth transition. Project page and examples are available at: https://denfed.github.io/audiomatchcut/

* Accepted to ICASSP 2024

Via

Access Paper or Ask Questions

Fine-Grained Engine Fault Sound Event Detection Using Multimodal Signals

Mar 16, 2024

Dennis Fedorishin, Livio Forte III, Philip Schneider, Srirangaraj Setlur, Venu Govindaraju

Abstract:Sound event detection (SED) is an active area of audio research that aims to detect the temporal occurrence of sounds. In this paper, we apply SED to engine fault detection by introducing a multimodal SED framework that detects fine-grained engine faults of automobile engines using audio and accelerometer-recorded vibration. We first introduce the problem of engine fault SED on a dataset collected from a large variety of vehicles with expertly-labeled engine fault sound events. Next, we propose a SED model to temporally detect ten fine-grained engine faults that occur within vehicle engines and further explore a pretraining strategy using a large-scale weakly-labeled engine fault dataset. Through multiple evaluations, we show our proposed framework is able to effectively detect engine fault sound events. Finally, we investigate the interaction and characteristics of each modality and show that fusing features from audio and vibration improves overall engine fault SED capabilities.

* Accepted to ICASSP 2024

Via

Access Paper or Ask Questions

SpaDen : Sparse and Dense Keypoint Estimation for Real-World Chart Understanding

Aug 03, 2023

Saleem Ahmed, Pengyu Yan, David Doermann, Srirangaraj Setlur, Venu Govindaraju

Figure 1 for SpaDen : Sparse and Dense Keypoint Estimation for Real-World Chart Understanding

Figure 2 for SpaDen : Sparse and Dense Keypoint Estimation for Real-World Chart Understanding

Figure 3 for SpaDen : Sparse and Dense Keypoint Estimation for Real-World Chart Understanding

Figure 4 for SpaDen : Sparse and Dense Keypoint Estimation for Real-World Chart Understanding

Abstract:We introduce a novel bottom-up approach for the extraction of chart data. Our model utilizes images of charts as inputs and learns to detect keypoints (KP), which are used to reconstruct the components within the plot area. Our novelty lies in detecting a fusion of continuous and discrete KP as predicted heatmaps. A combination of sparse and dense per-pixel objectives coupled with a uni-modal self-attention-based feature-fusion layer is applied to learn KP embeddings. Further leveraging deep metric learning for unsupervised clustering, allows us to segment the chart plot area into various objects. By further matching the chart components to the legend, we are able to obtain the data series names. A post-processing threshold is applied to the KP embeddings to refine the object reconstructions and improve accuracy. Our extensive experiments include an evaluation of different modules for KP estimation and the combination of deep layer aggregation and corner pooling approaches. The results of our experiments provide extensive evaluation for the task of real-world chart data extraction.

* Accepted ORAL at ICDAR 23

Via

Access Paper or Ask Questions

RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic

Aug 03, 2023

Saleem Ahmed, Bhavin Jawade, Shubham Pandey, Srirangaraj Setlur, Venu Govindaraju

Figure 1 for RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic

Figure 2 for RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic

Figure 3 for RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic

Figure 4 for RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic

Abstract:We present a comprehensive study of chart visual question-answering(QA) task, to address the challenges faced in comprehending and extracting data from chart visualizations within documents. Despite efforts to tackle this problem using synthetic charts, solutions are limited by the shortage of annotated real-world data. To fill this gap, we introduce a benchmark and dataset for chart visual QA on real-world charts, offering a systematic analysis of the task and a novel taxonomy for template-based chart question creation. Our contribution includes the introduction of a new answer type, 'list', with both ranked and unranked variations. Our study is conducted on a real-world chart dataset from scientific literature, showcasing higher visual complexity compared to other works. Our focus is on template-based QA and how it can serve as a standard for evaluating the first-order logic capabilities of models. The results of our experiments, conducted on a real-world out-of-distribution dataset, provide a robust evaluation of large-scale pre-trained models and advance the field of chart visual QA and formal logic verification for neural networks in general.

* This a pre-print version. Accepted at ICDAR '23

Via

Access Paper or Ask Questions

CoNAN: Conditional Neural Aggregation Network For Unconstrained Face Feature Fusion

Jul 16, 2023

Bhavin Jawade, Deen Dayal Mohan, Dennis Fedorishin, Srirangaraj Setlur, Venu Govindaraju

Figure 1 for CoNAN: Conditional Neural Aggregation Network For Unconstrained Face Feature Fusion

Figure 2 for CoNAN: Conditional Neural Aggregation Network For Unconstrained Face Feature Fusion

Figure 3 for CoNAN: Conditional Neural Aggregation Network For Unconstrained Face Feature Fusion

Figure 4 for CoNAN: Conditional Neural Aggregation Network For Unconstrained Face Feature Fusion

Abstract:Face recognition from image sets acquired under unregulated and uncontrolled settings, such as at large distances, low resolutions, varying viewpoints, illumination, pose, and atmospheric conditions, is challenging. Face feature aggregation, which involves aggregating a set of N feature representations present in a template into a single global representation, plays a pivotal role in such recognition systems. Existing works in traditional face feature aggregation either utilize metadata or high-dimensional intermediate feature representations to estimate feature quality for aggregation. However, generating high-quality metadata or style information is not feasible for extremely low-resolution faces captured in long-range and high altitude settings. To overcome these limitations, we propose a feature distribution conditioning approach called CoNAN for template aggregation. Specifically, our method aims to learn a context vector conditioned over the distribution information of the incoming feature set, which is utilized to weigh the features based on their estimated informativeness. The proposed method produces state-of-the-art results on long-range unconstrained face recognition datasets such as BTS, and DroneSURF, validating the advantages of such an aggregation strategy.

* Paper accepted at IJCB 2023

Via

Access Paper or Ask Questions

RidgeBase: A Cross-Sensor Multi-Finger Contactless Fingerprint Dataset

Jul 09, 2023

Bhavin Jawade, Deen Dayal Mohan, Srirangaraj Setlur, Nalini Ratha, Venu Govindaraju

Figure 1 for RidgeBase: A Cross-Sensor Multi-Finger Contactless Fingerprint Dataset

Figure 2 for RidgeBase: A Cross-Sensor Multi-Finger Contactless Fingerprint Dataset

Figure 3 for RidgeBase: A Cross-Sensor Multi-Finger Contactless Fingerprint Dataset

Figure 4 for RidgeBase: A Cross-Sensor Multi-Finger Contactless Fingerprint Dataset

Abstract:Contactless fingerprint matching using smartphone cameras can alleviate major challenges of traditional fingerprint systems including hygienic acquisition, portability and presentation attacks. However, development of practical and robust contactless fingerprint matching techniques is constrained by the limited availability of large scale real-world datasets. To motivate further advances in contactless fingerprint matching across sensors, we introduce the RidgeBase benchmark dataset. RidgeBase consists of more than 15,000 contactless and contact-based fingerprint image pairs acquired from 88 individuals under different background and lighting conditions using two smartphone cameras and one flatbed contact sensor. Unlike existing datasets, RidgeBase is designed to promote research under different matching scenarios that include Single Finger Matching and Multi-Finger Matching for both contactless- to-contactless (CL2CL) and contact-to-contactless (C2CL) verification and identification. Furthermore, due to the high intra-sample variance in contactless fingerprints belonging to the same finger, we propose a set-based matching protocol inspired by the advances in facial recognition datasets. This protocol is specifically designed for pragmatic contactless fingerprint matching that can account for variances in focus, polarity and finger-angles. We report qualitative and quantitative baseline results for different protocols using a COTS fingerprint matcher (Verifinger) and a Deep CNN based approach on the RidgeBase dataset. The dataset can be downloaded here: https://www.buffalo.edu/cubs/research/datasets/ridgebase-benchmark-dataset.html

* 2022 IEEE International Joint Conference on Biometrics (IJCB), Abu Dhabi, United Arab Emirates, 2022, pp. 1-9
* Paper accepted at IJCB 2022

Via

Access Paper or Ask Questions

Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

Nov 06, 2022

Dennis Fedorishin, Deen Dayal Mohan, Bhavin Jawade, Srirangaraj Setlur, Venu Govindaraju

Figure 1 for Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

Figure 2 for Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

Figure 3 for Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

Figure 4 for Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

Abstract:Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research. Existing work in this area focuses on creating attention maps to capture the correlation between the two modalities to localize the source of the sound. In a video, oftentimes, the objects exhibiting movement are the ones generating the sound. In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source. We further demonstrate that the addition of flow-based attention substantially improves visual sound source localization. Finally, we benchmark our method on standard sound source localization datasets and achieve state-of-the-art performance on the Soundnet Flickr and VGG Sound Source datasets. Code: https://github.com/denfed/heartheflow.

* Accepted to WACV 2023

Via

Access Paper or Ask Questions