Alert button
Picture for Maitreya Patel

Maitreya Patel

Alert button

WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models

Jun 07, 2023
Changhoon Kim, Kyle Min, Maitreya Patel, Sheng Cheng, Yezhou Yang

Figure 1 for WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
Figure 2 for WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
Figure 3 for WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
Figure 4 for WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models

The rapid advancement of generative models, facilitating the creation of hyper-realistic images from textual descriptions, has concurrently escalated critical societal concerns such as misinformation. Traditional fake detection mechanisms, although providing some mitigation, fall short in attributing responsibility for the malicious use of synthetic images. This paper introduces a novel approach to model fingerprinting that assigns responsibility for the generated images, thereby serving as a potential countermeasure to model misuse. Our method modifies generative models based on each user's unique digital fingerprint, imprinting a unique identifier onto the resultant content that can be traced back to the user. This approach, incorporating fine-tuning into Text-to-Image (T2I) tasks using the Stable Diffusion Model, demonstrates near-perfect attribution accuracy with a minimal impact on output quality. We rigorously scrutinize our method's secrecy under two distinct scenarios: one where a malicious user attempts to detect the fingerprint, and another where a user possesses a comprehensive understanding of our method. We also evaluate the robustness of our approach against various image post-processing manipulations typically executed by end-users. Through extensive evaluation of the Stable Diffusion models, our method presents a promising and novel avenue for accountable model distribution and responsible use.

Viaarxiv icon

ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models

Jun 07, 2023
Maitreya Patel, Tejas Gokhale, Chitta Baral, Yezhou Yang

The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding. To quantify the ability of T2I models in learning and synthesizing novel visual concepts, we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, 5K unique concept compositions, and 33K composite text prompts. Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (CCD), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in ground truth images. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our human study shows that CCD is highly correlated with human understanding of concepts. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome.

* Project page: https://conceptbed.github.io 
Viaarxiv icon

CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering

Nov 07, 2022
Maitreya Patel, Tejas Gokhale, Chitta Baral, Yezhou Yang

Figure 1 for CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering
Figure 2 for CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering
Figure 3 for CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering
Figure 4 for CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering

Videos often capture objects, their visible properties, their motion, and the interactions between different objects. Objects also have physical properties such as mass, which the imaging pipeline is unable to directly capture. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. In this paper, we introduce CRIPP-VQA, a new video question answering dataset for reasoning about the implicit physical properties of objects in a scene. CRIPP-VQA contains videos of objects in motion, annotated with questions that involve counterfactual reasoning about the effect of actions, questions about planning in order to reach a goal, and descriptive questions about visible properties of objects. The CRIPP-VQA test set enables evaluation under several out-of-distribution settings -- videos with objects with masses, coefficients of friction, and initial velocities that are not observed in the training distribution. Our experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties (the focus of this paper) and explicit properties of objects (the focus of prior work).

* Accepted to EMNLP 2022; https://maitreyapatel.com/CRIPP-VQA/ 
Viaarxiv icon

Reasoning about Actions over Visual and Linguistic Modalities: A Survey

Jul 15, 2022
Shailaja Keyur Sampat, Maitreya Patel, Subhasish Das, Yezhou Yang, Chitta Baral

Figure 1 for Reasoning about Actions over Visual and Linguistic Modalities: A Survey
Figure 2 for Reasoning about Actions over Visual and Linguistic Modalities: A Survey
Figure 3 for Reasoning about Actions over Visual and Linguistic Modalities: A Survey
Figure 4 for Reasoning about Actions over Visual and Linguistic Modalities: A Survey

'Actions' play a vital role in how humans interact with the world and enable them to achieve desired goals. As a result, most common sense (CS) knowledge for humans revolves around actions. While 'Reasoning about Actions & Change' (RAC) has been widely studied in the Knowledge Representation community, it has recently piqued the interest of NLP and computer vision researchers. This paper surveys existing tasks, benchmark datasets, various techniques and models, and their respective performance concerning advancements in RAC in the vision and language domain. Towards the end, we summarize our key takeaways, discuss the present challenges facing this research area, and outline potential directions for future research.

* 7 pages, 3 figures; This survey will be periodically updated with the latest works in this area 
Viaarxiv icon

Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks

Apr 16, 2022
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Shailaja Keyur Sampat, Savan Doshi, Siddhartha Mishra, Sujan Reddy, Sumanta Patro, Tanay Dixit, Xudong Shen, Chitta Baral, Yejin Choi, Hannaneh Hajishirzi, Noah A. Smith, Daniel Khashabi

Figure 1 for Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks
Figure 2 for Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks
Figure 3 for Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks
Figure 4 for Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks

How can we measure the generalization of models to a variety of unseen tasks when provided with their language instructions? To facilitate progress in this goal, we introduce Natural-Instructions v2, a collection of 1,600+ diverse language tasks and their expert written instructions. More importantly, the benchmark covers 70+ distinct task types, such as tagging, in-filling, and rewriting. This benchmark is collected with contributions of NLP practitioners in the community and through an iterative peer review process to ensure their quality. This benchmark enables large-scale evaluation of cross-task generalization of the models -- training on a subset of tasks and evaluating on the remaining unseen ones. For instance, we are able to rigorously quantify generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances, and model sizes. As a by-product of these experiments. we introduce Tk-Instruct, an encoder-decoder Transformer that is trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples) which outperforms existing larger models on our benchmark. We hope this benchmark facilitates future progress toward more general-purpose language understanding models.

* 16 pages, 9 figures 
Viaarxiv icon

CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion

Aug 18, 2020
Maitreya Patel, Mirali Purohit, Jui Shah, Hemant A. Patil

Figure 1 for CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion
Figure 2 for CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion
Figure 3 for CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion
Figure 4 for CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion

Recently, Generative Adversarial Networks (GAN)-based methods have shown remarkable performance for the Voice Conversion and WHiSPer-to-normal SPeeCH (WHSP2SPCH) conversion. One of the key challenges in WHSP2SPCH conversion is the prediction of fundamental frequency (F0). Recently, authors have proposed state-of-the-art method Cycle-Consistent Generative Adversarial Networks (CycleGAN) for WHSP2SPCH conversion. The CycleGAN-based method uses two different models, one for Mel Cepstral Coefficients (MCC) mapping, and another for F0 prediction, where F0 is highly dependent on the pre-trained model of MCC mapping. This leads to additional non-linear noise in predicted F0. To suppress this noise, we propose Cycle-in-Cycle GAN (i.e., CinC-GAN). It is specially designed to increase the effectiveness in F0 prediction without losing the accuracy of MCC mapping. We evaluated the proposed method on a non-parallel setting and analyzed on speaker-specific, and gender-specific tasks. The objective and subjective tests show that CinC-GAN significantly outperforms the CycleGAN. In addition, we analyze the CycleGAN and CinC-GAN for unseen speakers and the results show the clear superiority of CinC-GAN.

* Accepted in 28th European Signal Processing Conference (EUSIPCO), 2020 
Viaarxiv icon

Precipitation Nowcasting: Leveraging bidirectional LSTM and 1D CNN

Oct 24, 2018
Maitreya Patel, Anery Patel, Dr. Ranendu Ghosh

Figure 1 for Precipitation Nowcasting: Leveraging bidirectional LSTM and 1D CNN
Figure 2 for Precipitation Nowcasting: Leveraging bidirectional LSTM and 1D CNN
Figure 3 for Precipitation Nowcasting: Leveraging bidirectional LSTM and 1D CNN
Figure 4 for Precipitation Nowcasting: Leveraging bidirectional LSTM and 1D CNN

Short-term rainfall forecasting, also known as precipitation nowcasting has become a potentially fundamental technology impacting significant real-world applications ranging from flight safety, rainstorm alerts to farm irrigation timings. Since weather forecasting involves identifying the underlying structure in a huge amount of data, deep-learning based precipitation nowcasting has intuitively outperformed the traditional linear extrapolation methods. Our research work intends to utilize the recent advances in deep learning to nowcasting, a multi-variable time series forecasting problem. Specifically, we leverage a bidirectional LSTM (Long Short-Term Memory) neural network architecture which remarkably captures the temporal features and long-term dependencies from historical data. To further our studies, we compare the bidirectional LSTM network with 1D CNN model to prove the capabilities of sequence models over feed-forward neural architectures in forecasting related problems.

Viaarxiv icon