Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vinay P. Namboodiri

Robust Explanations for Visual Question Answering

Jan 23, 2020
Badri N. Patro, Shivansh Pate, Vinay P. Namboodiri

Figure 1 for Robust Explanations for Visual Question Answering

Figure 2 for Robust Explanations for Visual Question Answering

Figure 3 for Robust Explanations for Visual Question Answering

Figure 4 for Robust Explanations for Visual Question Answering

In this paper, we propose a method to obtain robust explanations for visual question answering(VQA) that correlate well with the answers. Our model explains the answers obtained through a VQA model by providing visual and textual explanations. The main challenges that we address are i) Answers and textual explanations obtained by current methods are not well correlated and ii) Current methods for visual explanation do not focus on the right location for explaining the answer. We address both these challenges by using a collaborative correlated module which ensures that even if we do not train for noise based attacks, the enhanced correlation ensures that the right explanation and answer can be generated. We further show that this also aids in improving the generated visual and textual explanations. The use of the correlated module can be thought of as a robust method to verify if the answer and explanations are coherent. We evaluate this model using VQA-X dataset. We observe that the proposed method yields better textual and visual justification that supports the decision. We showcase the robustness of the model against a noise-based perturbation attack using corresponding visual and textual explanations. A detailed empirical analysis is shown. Here we provide source code link for our model \url{https://github.com/DelTA-Lab-IITK/CCM-WACV}.

* WACV-2020 (Accepted)

Via

Access Paper or Ask Questions

A "Network Pruning Network" Approach to Deep Model Compression

Jan 15, 2020
Vinay Kumar Verma, Pravendra Singh, Vinay P. Namboodiri, Piyush Rai

Figure 1 for A "Network Pruning Network" Approach to Deep Model Compression

Figure 2 for A "Network Pruning Network" Approach to Deep Model Compression

Figure 3 for A "Network Pruning Network" Approach to Deep Model Compression

Figure 4 for A "Network Pruning Network" Approach to Deep Model Compression

We present a filter pruning approach for deep model compression, using a multitask network. Our approach is based on learning a a pruner network to prune a pre-trained target network. The pruner is essentially a multitask deep neural network with binary outputs that help identify the filters from each layer of the original network that do not have any significant contribution to the model and can therefore be pruned. The pruner network has the same architecture as the original network except that it has a multitask/multi-output last layer containing binary-valued outputs (one per filter), which indicate which filters have to be pruned. The pruner's goal is to minimize the number of filters from the original network by assigning zero weights to the corresponding output feature-maps. In contrast to most of the existing methods, instead of relying on iterative pruning, our approach can prune the network (original network) in one go and, moreover, does not require specifying the degree of pruning for each layer (and can learn it instead). The compressed model produced by our approach is generic and does not need any special hardware/software support. Moreover, augmenting with other methods such as knowledge distillation, quantization, and connection pruning can increase the degree of compression for the proposed approach. We show the efficacy of our proposed approach for classification and object detection tasks.

* Accepted in WACV'20

Via

Access Paper or Ask Questions

Cooperative Initialization based Deep Neural Network Training

Jan 05, 2020
Pravendra Singh, Munender Varshney, Vinay P. Namboodiri

Figure 1 for Cooperative Initialization based Deep Neural Network Training

Figure 2 for Cooperative Initialization based Deep Neural Network Training

Figure 3 for Cooperative Initialization based Deep Neural Network Training

Figure 4 for Cooperative Initialization based Deep Neural Network Training

Researchers have proposed various activation functions. These activation functions help the deep network to learn non-linear behavior with a significant effect on training dynamics and task performance. The performance of these activations also depends on the initial state of the weight parameters, i.e., different initial state leads to a difference in the performance of a network. In this paper, we have proposed a cooperative initialization for training the deep network using ReLU activation function to improve the network performance. Our approach uses multiple activation functions in the initial few epochs for the update of all sets of weight parameters while training the network. These activation functions cooperate to overcome their drawbacks in the update of weight parameters, which in effect learn better "feature representation" and boost the network performance later. Cooperative initialization based training also helps in reducing the overfitting problem and does not increase the number of parameters, inference (test) time in the final model while improving the performance. Experiments show that our approach outperforms various baselines and, at the same time, performs well over various tasks such as classification and detection. The Top-1 classification accuracy of the model trained using our approach improves by 2.8% for VGG-16 and 2.1% for ResNet-56 on CIFAR-100 dataset.

* IEEE Winter Conference on Applications of Computer Vision (WACV), 2020

Via

Access Paper or Ask Questions

Revisiting Paraphrase Question Generator using Pairwise Discriminator

Jan 04, 2020
Badri N. Patro, Dev Chauhan, Vinod K. Kurmi, Vinay P. Namboodiri

Figure 1 for Revisiting Paraphrase Question Generator using Pairwise Discriminator

Figure 2 for Revisiting Paraphrase Question Generator using Pairwise Discriminator

Figure 3 for Revisiting Paraphrase Question Generator using Pairwise Discriminator

Figure 4 for Revisiting Paraphrase Question Generator using Pairwise Discriminator

In this paper, we propose a method for obtaining sentence-level embeddings. While the problem of securing word-level embeddings is very well studied, we propose a novel method for obtaining sentence-level embeddings. This is obtained by a simple method in the context of solving the paraphrase generation task. If we use a sequential encoder-decoder model for generating paraphrase, we would like the generated paraphrase to be semantically close to the original sentence. One way to ensure this is by adding constraints for true paraphrase embeddings to be close and unrelated paraphrase candidate sentence embeddings to be far. This is ensured by using a sequential pair-wise discriminator that shares weights with the encoder that is trained with a suitable loss function. Our loss function penalizes paraphrase sentence embedding distances from being too large. This loss is used in combination with a sequential encoder-decoder network. We also validated our method by evaluating the obtained embeddings for a sentiment analysis task. The proposed method results in semantic embeddings and outperforms the state-of-the-art on the paraphrase generation and sentiment analysis task on standard datasets. These results are also shown to be statistically significant.

* This work is an extension of our COLING-2018 paper arXiv:1806.00807

Via

Access Paper or Ask Questions

Deep Exemplar Networks for VQA and VQG

Dec 19, 2019
Badri N. Patro, Vinay P. Namboodiri

Figure 1 for Deep Exemplar Networks for VQA and VQG

Figure 2 for Deep Exemplar Networks for VQA and VQG

Figure 3 for Deep Exemplar Networks for VQA and VQG

Figure 4 for Deep Exemplar Networks for VQA and VQG

In this paper, we consider the problem of solving semantic tasks such as `Visual Question Answering' (VQA), where one aims to answers related to an image and `Visual Question Generation' (VQG), where one aims to generate a natural question pertaining to an image. Solutions for VQA and VQG tasks have been proposed using variants of encoder-decoder deep learning based frameworks that have shown impressive performance. Humans however often show generalization by relying on exemplar based approaches. For instance, the work by Tversky and Kahneman suggests that humans use exemplars when making categorizations and decisions. In this work, we propose the incorporation of exemplar based approaches towards solving these problems. Specifically, we incorporate exemplar based approaches and show that an exemplar based module can be incorporated in almost any of the deep learning architectures proposed in the literature and the addition of such a block results in improved performance for solving these tasks. Thus, just as the incorporation of attention is now considered de facto useful for solving these tasks, similarly, incorporating exemplars also can be considered to improve any proposed architecture for solving this task. We provide extensive empirical analysis for the same through various architectures, ablations, and state of the art comparisons.

* This work is an extension of CVPR-2018 accepted paper arXiv:1804.00298 and EMNLP-2018 accepted paper arXiv:1808.03986

Via

Access Paper or Ask Questions

Jointly Trained Image and Video Generation using Residual Vectors

Dec 17, 2019
Yatin Dandi, Aniket Das, Soumye Singhal, Vinay P. Namboodiri, Piyush Rai

Figure 1 for Jointly Trained Image and Video Generation using Residual Vectors

Figure 2 for Jointly Trained Image and Video Generation using Residual Vectors

Figure 3 for Jointly Trained Image and Video Generation using Residual Vectors

Figure 4 for Jointly Trained Image and Video Generation using Residual Vectors

In this work, we propose a modeling technique for jointly training image and video generation models by simultaneously learning to map latent variables with a fixed prior onto real images and interpolate over images to generate videos. The proposed approach models the variations in representations using residual vectors encoding the change at each time step over a summary vector for the entire video. We utilize the technique to jointly train an image generation model with a fixed prior along with a video generation model lacking constraints such as disentanglement. The joint training enables the image generator to exploit temporal information while the video generation model learns to flexibly share information across frames. Moreover, experimental results verify our approach's compatibility with pre-training on videos or images and training on datasets containing a mixture of both. A comprehensive set of quantitative and qualitative evaluations reveal the improvements in sample quality and diversity over both video generation and image generation baselines. We further demonstrate the technique's capabilities of exploiting similarity in features across frames by applying it to a model based on decomposing the video into motion and content. The proposed model allows minor variations in content across frames while maintaining the temporal dependence through latent vectors encoding the pose or motion features.

* Accepted in 2020 Winter Conference on Applications of Computer Vision (WACV '20)

Via

Access Paper or Ask Questions

Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA

Nov 19, 2019
Badri N. Patro, Anupriy, Vinay P. Namboodiri

Figure 1 for Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA

Figure 2 for Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA

Figure 3 for Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA

Figure 4 for Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA

In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation mappings (specifically Grad-CAM) that are meant to explain the performance of various networks could form a means of supervision. However, as the distributions of attention maps and that of Grad-CAMs differ, it would not be suitable to directly use these as a form of supervision. Rather, we propose the use of a discriminator that aims to distinguish samples of visual explanation and attention maps. The use of adversarial training of the attention regions as a two-player game between attention and explanation serves to bring the distributions of attention maps and visual explanations closer. Significantly, we observe that providing such a means of supervision also results in attention maps that are more closely related to human attention resulting in a substantial improvement over baseline stacked attention network (SAN) models. It also results in a good improvement in rank correlation metric on the VQA task. This method can also be combined with recent MCB based methods and results in consistent improvement. We also provide comparisons with other means for learning distributions such as based on Correlation Alignment (Coral), Maximum Mean Discrepancy (MMD) and Mean Square Error (MSE) losses and observe that the adversarial loss outperforms the other forms of learning the attention maps. Visualization of the results also confirms our hypothesis that attention maps improve using this form of supervision.

* AAAI-2020(Accepted)

Via

Access Paper or Ask Questions

Can I teach a robot to replicate a line art

Oct 17, 2019
Raghav Brahmadesam Venkataramaiyer, Subham Kumar, Vinay P. Namboodiri

Figure 1 for Can I teach a robot to replicate a line art

Figure 2 for Can I teach a robot to replicate a line art

Figure 3 for Can I teach a robot to replicate a line art

Figure 4 for Can I teach a robot to replicate a line art

Line art is arguably one of the fundamental and versatile modes of expression. We propose a pipeline for a robot to look at a grayscale line art and redraw it. The key novel elements of our pipeline are: a) we propose a novel task of mimicking line drawings, b) to solve the pipeline we modify the Quick-draw dataset to obtain supervised training for converting a line drawing into a series of strokes c) we propose a multi-stage segmentation and graph interpretation pipeline for solving the problem. The resultant method has also been deployed on a CNC plotter as well as a robotic arm. We have trained several variations of the proposed methods and evaluate these on a dataset obtained from Quick-draw. Through the best methods we observe an accuracy of around 98% for this task, which is a significant improvement over the baseline architecture we adapted from. This therefore allows for deployment of the method on robots for replicating line art in a reliable manner. We also show that while the rule-based vectorization methods do suffice for simple drawings, it fails for more complicated sketches, unlike our method which generalizes well to more complicated distributions.

* 9 pages, Accepted for the 2020 Winter Conference on Applications of Computer Vision (WACV '20); Supplementary Video: https://youtu.be/nMt5Dw04XhY

Via

Access Paper or Ask Questions

Probabilistic framework for solving Visual Dialog

Oct 17, 2019
Badri N. Patro, Anupriy, Vinay P. Namboodiri

Figure 1 for Probabilistic framework for solving Visual Dialog

Figure 2 for Probabilistic framework for solving Visual Dialog

Figure 3 for Probabilistic framework for solving Visual Dialog

Figure 4 for Probabilistic framework for solving Visual Dialog

In this paper, we propose a probabilistic framework for solving the task of `Visual Dialog'. Solving this task requires reasoning and understanding of visual modality, language modality, and common sense knowledge to answer. Various architectures have been proposed to solve this task by variants of multi-modal deep learning techniques that combine visual and language representations. However, we believe that it is crucial to understand and analyze the sources of uncertainty for solving this task. Our approach allows for estimating uncertainty and also aids a diverse generation of answers. The proposed approach is obtained through a probabilistic representation module that provides us with representations for image, question and conversation history, a module that ensures that diverse latent representations for candidate answers are obtained given the probabilistic representations and an uncertainty representation module that chooses the appropriate answer that minimizes uncertainty. We thoroughly evaluate the model with a detailed ablation analysis, comparison with state of the art and visualization of the uncertainty that aids in the understanding of the method. Using the proposed probabilistic framework, we thus obtain an improved visual dialog system that is also more explainable.

Via

Access Paper or Ask Questions