Alert button
Picture for Rama Chellappa

Rama Chellappa

Alert button

GADER: GAit DEtection and Recognition in the Wild

Jul 27, 2023
Yuxiang Guo, Cheng Peng, Ram Prabhakar, Chun Pong Lau, Rama Chellappa

Figure 1 for GADER: GAit DEtection and Recognition in the Wild
Figure 2 for GADER: GAit DEtection and Recognition in the Wild
Figure 3 for GADER: GAit DEtection and Recognition in the Wild
Figure 4 for GADER: GAit DEtection and Recognition in the Wild

Gait recognition holds the promise of robustly identifying subjects based on their walking patterns instead of color information. While previous approaches have performed well for curated indoor scenes, they have significantly impeded applicability in unconstrained situations, e.g. outdoor, long distance scenes. We propose an end-to-end GAit DEtection and Recognition (GADER) algorithm for human authentication in challenging outdoor scenarios. Specifically, GADER leverages a Double Helical Signature to detect the fragment of human movement and incorporates a novel gait recognition method, which learns representations by distilling from an auxiliary RGB recognition model. At inference time, GADER only uses the silhouette modality but benefits from a more robust representation. Extensive experiments on indoor and outdoor datasets demonstrate that the proposed method outperforms the State-of-The-Arts for gait recognition and verification, with a significant 20.6% improvement on unconstrained, long distance scenes.

Viaarxiv icon

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Jul 11, 2023
Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang

Figure 1 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 2 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 3 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 4 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, our proposed fusion in the backbone strategy is more lightweight and compute-efficient than stacking additional fusion-specific layers. Extensive experiments on a wide range of VL tasks demonstrate the effectiveness of EgoVLPv2 by achieving consistent state-of-the-art performance over strong baselines across all downstream. Our project page can be found at https://shramanpramanick.github.io/EgoVLPv2/.

* 22 pages, 11 figures 
Viaarxiv icon

SMC-UDA: Structure-Modal Constraint for Unsupervised Cross-Domain Renal Segmentation

Jun 14, 2023
Zhusi Zhong, Jie Li, Lulu Bi, Li Yang, Ihab Kamel, Rama Chellappa, Xinbo Gao, Harrison Bai, Zhicheng Jiao

Figure 1 for SMC-UDA: Structure-Modal Constraint for Unsupervised Cross-Domain Renal Segmentation
Figure 2 for SMC-UDA: Structure-Modal Constraint for Unsupervised Cross-Domain Renal Segmentation
Figure 3 for SMC-UDA: Structure-Modal Constraint for Unsupervised Cross-Domain Renal Segmentation
Figure 4 for SMC-UDA: Structure-Modal Constraint for Unsupervised Cross-Domain Renal Segmentation

Medical image segmentation based on deep learning often fails when deployed on images from a different domain. The domain adaptation methods aim to solve domain-shift challenges, but still face some problems. The transfer learning methods require annotation on the target domain, and the generative unsupervised domain adaptation (UDA) models ignore domain-specific representations, whose generated quality highly restricts segmentation performance. In this study, we propose a novel Structure-Modal Constrained (SMC) UDA framework based on a discriminative paradigm and introduce edge structure as a bridge between domains. The proposed multi-modal learning backbone distills structure information from image texture to distinguish domain-invariant edge structure. With the structure-constrained self-learning and progressive ROI, our methods segment the kidney by locating the 3D spatial structure of the edge. We evaluated SMC-UDA on public renal segmentation datasets, adapting from the labeled source domain (CT) to the unlabeled target domain (CT/MRI). The experiments show that our proposed SMC-UDA has a strong generalization and outperforms generative UDA methods.

* conference 
Viaarxiv icon

DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection

May 23, 2023
Jiang Liu, Chun Pong Lau, Rama Chellappa

Figure 1 for DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection
Figure 2 for DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection
Figure 3 for DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection
Figure 4 for DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection

The increasingly pervasive facial recognition (FR) systems raise serious concerns about personal privacy, especially for billions of users who have publicly shared their photos on social media. Several attempts have been made to protect individuals from being identified by unauthorized FR systems utilizing adversarial attacks to generate encrypted face images. However, existing methods suffer from poor visual quality or low attack success rates, which limit their utility. Recently, diffusion models have achieved tremendous success in image generation. In this work, we ask: can diffusion models be used to generate adversarial examples to improve both visual quality and attack performance? We propose DiffProtect, which utilizes a diffusion autoencoder to generate semantically meaningful perturbations on FR systems. Extensive experiments demonstrate that DiffProtect produces more natural-looking encrypted images than state-of-the-art methods while achieving significantly higher attack success rates, e.g., 24.5% and 25.1% absolute improvements on the CelebA-HQ and FFHQ datasets.

Viaarxiv icon

Attribute-Guided Encryption with Facial Texture Masking

May 22, 2023
Chun Pong Lau, Jiang Liu, Rama Chellappa

Figure 1 for Attribute-Guided Encryption with Facial Texture Masking
Figure 2 for Attribute-Guided Encryption with Facial Texture Masking
Figure 3 for Attribute-Guided Encryption with Facial Texture Masking
Figure 4 for Attribute-Guided Encryption with Facial Texture Masking

The increasingly pervasive facial recognition (FR) systems raise serious concerns about personal privacy, especially for billions of users who have publicly shared their photos on social media. Several attempts have been made to protect individuals from unauthorized FR systems utilizing adversarial attacks to generate encrypted face images to protect users from being identified by FR systems. However, existing methods suffer from poor visual quality or low attack success rates, which limit their usability in practice. In this paper, we propose Attribute Guided Encryption with Facial Texture Masking (AGE-FTM) that performs a dual manifold adversarial attack on FR systems to achieve both good visual quality and high black box attack success rates. In particular, AGE-FTM utilizes a high fidelity generative adversarial network (GAN) to generate natural on-manifold adversarial samples by modifying facial attributes, and performs the facial texture masking attack to generate imperceptible off-manifold adversarial samples. Extensive experiments on the CelebA-HQ dataset demonstrate that our proposed method produces more natural-looking encrypted images than state-of-the-art methods while achieving competitive attack performance. We further evaluate the effectiveness of AGE-FTM in the real world using a commercial FR API and validate its usefulness in practice through an user study.

Viaarxiv icon

The 7th AI City Challenge

Apr 15, 2023
Milind Naphade, Shuo Wang, David C. Anastasiu, Zheng Tang, Ming-Ching Chang, Yue Yao, Liang Zheng, Mohammed Shaiqur Rahman, Meenakshi S. Arya, Anuj Sharma, Qi Feng, Vitaly Ablavsky, Stan Sclaroff, Pranamesh Chakraborty, Sanjita Prajapati, Alice Li, Shangru Li, Krishna Kunadharaju, Shenxin Jiang, Rama Chellappa

Figure 1 for The 7th AI City Challenge
Figure 2 for The 7th AI City Challenge
Figure 3 for The 7th AI City Challenge
Figure 4 for The 7th AI City Challenge

The AI City Challenge's seventh edition emphasizes two domains at the intersection of computer vision and artificial intelligence - retail business and Intelligent Traffic Systems (ITS) - that have considerable untapped potential. The 2023 challenge had five tracks, which drew a record-breaking number of participation requests from 508 teams across 46 countries. Track 1 was a brand new track that focused on multi-target multi-camera (MTMC) people tracking, where teams trained and evaluated using both real and highly realistic synthetic data. Track 2 centered around natural-language-based vehicle track retrieval. Track 3 required teams to classify driver actions in naturalistic driving analysis. Track 4 aimed to develop an automated checkout system for retail stores using a single view camera. Track 5, another new addition, tasked teams with detecting violations of the helmet rule for motorcyclists. Two leader boards were released for submissions based on different methods: a public leader board for the contest where external private data wasn't allowed and a general leader board for all results submitted. The participating teams' top performances established strong baselines and even outperformed the state-of-the-art in the proposed challenge tracks.

* Summary of the 7th AI City Challenge Workshop in conjunction with CVPR 2023 
Viaarxiv icon

MOST: Multiple Object localization with Self-supervised Transformers for object discovery

Apr 11, 2023
Sai Saketh Rambhatla, Ishan Misra, Rama Chellappa, Abhinav Shrivastava

Figure 1 for MOST: Multiple Object localization with Self-supervised Transformers for object discovery
Figure 2 for MOST: Multiple Object localization with Self-supervised Transformers for object discovery
Figure 3 for MOST: Multiple Object localization with Self-supervised Transformers for object discovery
Figure 4 for MOST: Multiple Object localization with Self-supervised Transformers for object discovery

We tackle the challenging task of unsupervised object localization in this work. Recently, transformers trained with self-supervised learning have been shown to exhibit object localization properties without being trained for this task. In this work, we present Multiple Object localization with Self-supervised Transformers (MOST) that uses features of transformers trained using self-supervised learning to localize multiple objects in real world images. MOST analyzes the similarity maps of the features using box counting; a fractal analysis tool to identify tokens lying on foreground patches. The identified tokens are then clustered together, and tokens of each cluster are used to generate bounding boxes on foreground regions. Unlike recent state-of-the-art object localization methods, MOST can localize multiple objects per image and outperforms SOTA algorithms on several object localization and discovery benchmarks on PASCAL-VOC 07, 12 and COCO20k datasets. Additionally, we show that MOST can be used for self-supervised pre-training of object detectors, and yields consistent improvements on fully, semi-supervised object detection and unsupervised region proposal generation.

Viaarxiv icon

HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions

Apr 01, 2023
Anshul Shah, Aniket Roy, Ketul Shah, Shlok Kumar Mishra, David Jacobs, Anoop Cherian, Rama Chellappa

Figure 1 for HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions
Figure 2 for HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions
Figure 3 for HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions
Figure 4 for HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions

Supervised learning of skeleton sequence encoders for action recognition has received significant attention in recent times. However, learning such encoders without labels continues to be a challenging problem. While prior works have shown promising results by applying contrastive learning to pose sequences, the quality of the learned representations is often observed to be closely tied to data augmentations that are used to craft the positives. However, augmenting pose sequences is a difficult task as the geometric constraints among the skeleton joints need to be enforced to make the augmentations realistic for that action. In this work, we propose a new contrastive learning approach to train models for skeleton-based action recognition without labels. Our key contribution is a simple module, HaLP - to Hallucinate Latent Positives for contrastive learning. Specifically, HaLP explores the latent space of poses in suitable directions to generate new positives. To this end, we present a novel optimization formulation to solve for the synthetic positives with an explicit control on their hardness. We propose approximations to the objective, making them solvable in closed form with minimal overhead. We show via experiments that using these generated positives within a standard contrastive learning framework leads to consistent improvements across benchmarks such as NTU-60, NTU-120, and PKU-II on tasks like linear evaluation, transfer learning, and kNN evaluation. Our code will be made available at https://github.com/anshulbshah/HaLP.

* To be presented at CVPR 2023 
Viaarxiv icon

Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances

Mar 17, 2023
Arun V. Reddy, Ketul Shah, William Paul, Rohita Mocharla, Judy Hoffman, Kapil D. Katyal, Dinesh Manocha, Celso M. de Melo, Rama Chellappa

Figure 1 for Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances
Figure 2 for Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances
Figure 3 for Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances
Figure 4 for Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances

Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data has shown promise as a way to avoid the substantial costs and potential ethical concerns associated with collecting and labeling enormous amounts of data in the real-world. However, synthetic data may differ from real data in important ways. This phenomenon, known as \textit{domain shift}, can limit the utility of synthetic data in robotics applications. To mitigate the effects of domain shift, substantial effort is being dedicated to the development of domain adaptation (DA) techniques. Yet, much remains to be understood about how best to develop these techniques. In this paper, we introduce a new dataset called Robot Control Gestures (RoCoG-v2). The dataset is composed of both real and synthetic videos from seven gesture classes, and is intended to support the study of synthetic-to-real domain shift for video-based action recognition. Our work expands upon existing datasets by focusing the action classes on gestures for human-robot teaming, as well as by enabling investigation of domain shift in both ground and aerial views. We present baseline results using state-of-the-art action recognition and domain adaptation algorithms and offer initial insight on tackling the synthetic-to-real and ground-to-air domain shifts.

* ICRA 2023. The first two authors contributed equally. Dataset available at: https://github.com/reddyav1/RoCoG-v2 
Viaarxiv icon

STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

Jan 02, 2023
Anshul Shah, Benjamin Lundell, Harpreet Sawhney, Rama Chellappa

Figure 1 for STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos
Figure 2 for STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos
Figure 3 for STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos
Figure 4 for STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We employ self-supervised representation learning via a training strategy that adapts off-the-shelf video features using a temporal module. Training implements self-supervised learning losses involving multiple cues such as appearance, motion and pose trajectories extracted from videos to learn generalizable representations. Our method extracts key steps via a tunable algorithm that clusters the representations extracted from procedural videos. We quantitatively evaluate our approach with key step localization and also demonstrate the effectiveness of the extracted representations on related downstream tasks like phase classification. Qualitative results demonstrate that the extracted key steps are meaningful to succinctly represent the procedural tasks.

Viaarxiv icon