Alert button
Picture for Ameya Joshi

Ameya Joshi

Alert button

PriViT: Vision Transformers for Fast Private Inference

Oct 06, 2023
Naren Dhyani, Jianqiao Mo, Minsu Cho, Ameya Joshi, Siddharth Garg, Brandon Reagen, Chinmay Hegde

The Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. However, ViTs are ill-suited for private inference using secure multi-party computation (MPC) protocols, due to the large number of non-polynomial operations (self-attention, feed-forward rectifiers, layer normalization). We propose PriViT, a gradient based algorithm to selectively "Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy. Our algorithm is conceptually simple, easy to implement, and achieves improved performance over existing approaches for designing MPC-friendly transformer architectures in terms of achieving the Pareto frontier in latency-accuracy. We confirm these improvements via experiments on several standard image classification tasks. Public code is available at https://github.com/NYU-DICE-Lab/privit.

* 18 pages, 14 figures 
Viaarxiv icon

Distributionally Robust Classification on a Data Budget

Aug 07, 2023
Benjamin Feuer, Ameya Joshi, Minh Pham, Chinmay Hegde

Figure 1 for Distributionally Robust Classification on a Data Budget
Figure 2 for Distributionally Robust Classification on a Data Budget
Figure 3 for Distributionally Robust Classification on a Data Budget
Figure 4 for Distributionally Robust Classification on a Data Budget

Real world uses of deep learning require predictable model behavior under distribution shifts. Models such as CLIP show emergent natural distributional robustness comparable to humans, but may require hundreds of millions of training samples. Can we train robust learners in a domain where data is limited? To rigorously address this question, we introduce JANuS (Joint Annotations and Names Set), a collection of four new training datasets with images, labels, and corresponding captions, and perform a series of carefully controlled investigations of factors contributing to robustness in image classification, then compare those results to findings derived from a large-scale meta-analysis. Using this approach, we show that standard ResNet-50 trained with the cross-entropy loss on 2.4 million image samples can attain comparable robustness to a CLIP ResNet-50 trained on 400 million samples. To our knowledge, this is the first result showing (near) state-of-the-art distributional robustness on limited data budgets. Our dataset is available at \url{https://huggingface.co/datasets/penfever/JANuS_dataset}, and the code used to reproduce our experiments can be found at \url{https://github.com/penfever/vlhub/}.

* TMLR 2023; openreview link: https://openreview.net/forum?id=D5Z2E8CNsD 
Viaarxiv icon

Identity-Preserving Aging of Face Images via Latent Diffusion Models

Jul 17, 2023
Sudipta Banerjee, Govind Mittal, Ameya Joshi, Chinmay Hegde, Nasir Memon

Figure 1 for Identity-Preserving Aging of Face Images via Latent Diffusion Models
Figure 2 for Identity-Preserving Aging of Face Images via Latent Diffusion Models
Figure 3 for Identity-Preserving Aging of Face Images via Latent Diffusion Models
Figure 4 for Identity-Preserving Aging of Face Images via Latent Diffusion Models

The performance of automated face recognition systems is inevitably impacted by the facial aging process. However, high quality datasets of individuals collected over several years are typically small in scale. In this work, we propose, train, and validate the use of latent text-to-image diffusion models for synthetically aging and de-aging face images. Our models succeed with few-shot training, and have the added benefit of being controllable via intuitive textual prompting. We observe high degrees of visual realism in the generated images while maintaining biometric fidelity measured by commonly used metrics. We evaluate our method on two benchmark datasets (CelebA and AgeDB) and observe significant reduction (~44%) in the False Non-Match Rate compared to existing state-of the-art baselines.

* Accepted to appear in International Joint Conference in Biometrics (IJCB) 2023 
Viaarxiv icon

Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Jun 22, 2023
Md Zahid Hasan, Jiajing Chen, Jiyang Wang, Mohammed Shaiqur Rahman, Ameya Joshi, Senem Velipasalar, Chinmay Hegde, Anuj Sharma, Soumik Sarkar

Figure 1 for Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Figure 2 for Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Figure 3 for Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Figure 4 for Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Recognizing the activities, causing distraction, in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically data-intensive and require a large volume of annotated training data to detect and classify various distracted driving behaviors, thereby limiting their efficiency and scalability. We aim to develop a generalized framework that showcases robust performance with access to limited or no annotated training data. Recently, vision-language models have offered large-scale visual-textual pretraining that can be adapted to task-specific learning like distracted driving activity recognition. Vision-language pretraining models, such as CLIP, have shown significant promise in learning natural language-guided visual representations. This paper proposes a CLIP-based driver activity recognition approach that identifies driver distraction from naturalistic driving images and videos. CLIP's vision embedding offers zero-shot transfer and task-based finetuning, which can classify distracted activities from driving video data. Our results show that this framework offers state-of-the-art performance on zero-shot transfer and video-based CLIP for predicting the driver's state on two public datasets. We propose both frame-based and video-based frameworks developed on top of the CLIP's visual representation for distracted driving detection and classification task and report the results.

* 15 pages, 10 figures 
Viaarxiv icon

ZeroForge: Feedforward Text-to-Shape Without 3D Supervision

Jun 16, 2023
Kelly O. Marshall, Minh Pham, Ameya Joshi, Anushrut Jignasu, Aditya Balu, Adarsh Krishnamurthy, Chinmay Hegde

Figure 1 for ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
Figure 2 for ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
Figure 3 for ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
Figure 4 for ZeroForge: Feedforward Text-to-Shape Without 3D Supervision

Current state-of-the-art methods for text-to-shape generation either require supervised training using a labeled dataset of pre-defined 3D shapes, or perform expensive inference-time optimization of implicit neural representations. In this work, we present ZeroForge, an approach for zero-shot text-to-shape generation that avoids both pitfalls. To achieve open-vocabulary shape generation, we require careful architectural adaptation of existing feed-forward approaches, as well as a combination of data-free CLIP-loss and contrastive losses to avoid mode collapse. Using these techniques, we are able to considerably expand the generative ability of existing feed-forward text-to-shape models such as CLIP-Forge. We support our method via extensive qualitative and quantitative evaluations

* 19 pages, High resolution figures needed to demonstrate 3D results 
Viaarxiv icon

Caption supervision enables robust learners

Oct 13, 2022
Benjamin Feuer, Ameya Joshi, Chinmay Hegde

Figure 1 for Caption supervision enables robust learners
Figure 2 for Caption supervision enables robust learners
Figure 3 for Caption supervision enables robust learners
Figure 4 for Caption supervision enables robust learners

Vision language models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that CNNs trained on a standard cross-entropy loss can also benefit from caption supervision, in some cases even more than VL models, on the same data. To facilitate future experiments with high-accuracy caption-supervised models, we introduce CaptionNet (https://github.com/penfever/CaptionNet/), which includes a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision. We also provide the codebase necessary to reproduce our experiments at https://github.com/penfever/vlhub/

Viaarxiv icon

Revisiting Self-Distillation

Jun 17, 2022
Minh Pham, Minsu Cho, Ameya Joshi, Chinmay Hegde

Figure 1 for Revisiting Self-Distillation
Figure 2 for Revisiting Self-Distillation
Figure 3 for Revisiting Self-Distillation
Figure 4 for Revisiting Self-Distillation

Knowledge distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student), often being used in the context of model compression. When both models have the same architecture, this procedure is called self-distillation. Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data. In this work, we systematically study self-distillation in a number of settings. We first show that even with a highly accurate teacher, self-distillation allows a student to surpass the teacher in all cases. Secondly, we revisit existing theoretical explanations of (self) distillation and identify contradicting examples, revealing possible drawbacks of these explanations. Finally, we provide an alternative explanation for the dynamics of self-distillation through the lens of loss landscape geometry. We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization.

Viaarxiv icon

A Meta-Analysis of Distributionally-Robust Models

Jun 15, 2022
Benjamin Feuer, Ameya Joshi, Chinmay Hegde

Figure 1 for A Meta-Analysis of Distributionally-Robust Models
Figure 2 for A Meta-Analysis of Distributionally-Robust Models
Figure 3 for A Meta-Analysis of Distributionally-Robust Models
Figure 4 for A Meta-Analysis of Distributionally-Robust Models

State-of-the-art image classifiers trained on massive datasets (such as ImageNet) have been shown to be vulnerable to a range of both intentional and incidental distribution shifts. On the other hand, several recent classifiers with favorable out-of-distribution (OOD) robustness properties have emerged, achieving high accuracy on their target tasks while maintaining their in-distribution accuracy on challenging benchmarks. We present a meta-analysis on a wide range of publicly released models, most of which have been published over the last twelve months. Through this meta-analysis, we empirically identify four main commonalities for all the best-performing OOD-robust models, all of which illuminate the considerable promise of vision-language pre-training.

* To be presented at ICML Workshop on Principles of Distribution Shift 2022. Copyright 2022 by the author(s) 
Viaarxiv icon