Alert button
Picture for Prerana Mukherjee

Prerana Mukherjee

Alert button

OCFormer: One-Class Transformer Network for Image Classification

Apr 25, 2022
Prerana Mukherjee, Chandan Kumar Roy, Swalpa Kumar Roy

Figure 1 for OCFormer: One-Class Transformer Network for Image Classification
Figure 2 for OCFormer: One-Class Transformer Network for Image Classification
Figure 3 for OCFormer: One-Class Transformer Network for Image Classification
Figure 4 for OCFormer: One-Class Transformer Network for Image Classification

We propose a novel deep learning framework based on Vision Transformers (ViT) for one-class classification. The core idea is to use zero-centered Gaussian noise as a pseudo-negative class for latent space representation and then train the network using the optimal loss function. In prior works, there have been tremendous efforts to learn a good representation using varieties of loss functions, which ensures both discriminative and compact properties. The proposed one-class Vision Transformer (OCFormer) is exhaustively experimented on CIFAR-10, CIFAR-100, Fashion-MNIST and CelebA eyeglasses datasets. Our method has shown significant improvements over competing CNN based one-class classifier approaches.

Viaarxiv icon

ASOC: Adaptive Self-aware Object Co-localization

Jan 27, 2022
Koteswar Rao Jerripothula, Prerana Mukherjee

Figure 1 for ASOC: Adaptive Self-aware Object Co-localization
Figure 2 for ASOC: Adaptive Self-aware Object Co-localization
Figure 3 for ASOC: Adaptive Self-aware Object Co-localization
Figure 4 for ASOC: Adaptive Self-aware Object Co-localization

The primary goal of this paper is to localize objects in a group of semantically similar images jointly, also known as the object co-localization problem. Most related existing works are essentially weakly-supervised, relying prominently on the neighboring images' weak-supervision. Although weak supervision is beneficial, it is not entirely reliable, for the results are quite sensitive to the neighboring images considered. In this paper, we combine it with a self-awareness phenomenon to mitigate this issue. By self-awareness here, we refer to the solution derived from the image itself in the form of saliency cue, which can also be unreliable if applied alone. Nevertheless, combining these two paradigms together can lead to a better co-localization ability. Specifically, we introduce a dynamic mediator that adaptively strikes a proper balance between the two static solutions to provide an optimal solution. Therefore, we call this method \textit{ASOC}: Adaptive Self-aware Object Co-localization. We perform exhaustive experiments on several benchmark datasets and validate that weak-supervision supplemented with self-awareness has superior performance outperforming several compared competing methods.

* K. R. Jerripothula and P. Mukherjee, "ASOC: Adaptive Self-Aware Object Co-Localization," 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1-6  
* Published in IEEE ICME 2021. Please cite this paper in the following manner: K. R. Jerripothula and P. Mukherjee, "ASOC: Adaptive Self-Aware Object Co-Localization," 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1-6, doi: 10.1109/ICME51207.2021.9428191 
Viaarxiv icon

MaskMTL: Attribute prediction in masked facial images with deep multitask learning

Jan 11, 2022
Prerana Mukherjee, Vinay Kaushik, Ronak Gupta, Ritika Jha, Daneshwari Kankanwadi, Brejesh Lall

Figure 1 for MaskMTL: Attribute prediction in masked facial images with deep multitask learning
Figure 2 for MaskMTL: Attribute prediction in masked facial images with deep multitask learning
Figure 3 for MaskMTL: Attribute prediction in masked facial images with deep multitask learning
Figure 4 for MaskMTL: Attribute prediction in masked facial images with deep multitask learning

Predicting attributes in the landmark free facial images is itself a challenging task which gets further complicated when the face gets occluded due to the usage of masks. Smart access control gates which utilize identity verification or the secure login to personal electronic gadgets may utilize face as a biometric trait. Particularly, the Covid-19 pandemic increasingly validates the essentiality of hygienic and contactless identity verification. In such cases, the usage of masks become more inevitable and performing attribute prediction helps in segregating the target vulnerable groups from community spread or ensuring social distancing for them in a collaborative environment. We create a masked face dataset by efficiently overlaying masks of different shape, size and textures to effectively model variability generated by wearing mask. This paper presents a deep Multi-Task Learning (MTL) approach to jointly estimate various heterogeneous attributes from a single masked facial image. Experimental results on benchmark face attribute UTKFace dataset demonstrate that the proposed approach supersedes in performance to other competing techniques. The source code is available at https://github.com/ritikajha/Attribute-prediction-in-masked-facial-images-with-deep-multitask-learning

* In Proceedings of 9th International Conference on Pattern Recognition and Machine Intelligence (PReMI 2021), Kolkata, India 
Viaarxiv icon

Hater-O-Genius Aggression Classification using Capsule Networks

May 24, 2021
Parth Patwa, Srinivas PYKL, Amitava Das, Prerana Mukherjee, Viswanath Pulabaigari

Figure 1 for Hater-O-Genius Aggression Classification using Capsule Networks
Figure 2 for Hater-O-Genius Aggression Classification using Capsule Networks
Figure 3 for Hater-O-Genius Aggression Classification using Capsule Networks
Figure 4 for Hater-O-Genius Aggression Classification using Capsule Networks

Contending hate speech in social media is one of the most challenging social problems of our time. There are various types of anti-social behavior in social media. Foremost of them is aggressive behavior, which is causing many social issues such as affecting the social lives and mental health of social media users. In this paper, we propose an end-to-end ensemble-based architecture to automatically identify and classify aggressive tweets. Tweets are classified into three categories - Covertly Aggressive, Overtly Aggressive, and Non-Aggressive. The proposed architecture is an ensemble of smaller subnetworks that are able to characterize the feature embeddings effectively. We demonstrate qualitatively that each of the smaller subnetworks is able to learn unique features. Our best model is an ensemble of Capsule Networks and results in a 65.2% F1 score on the Facebook test set, which results in a performance gain of 0.95% over the TRAC-2018 winners. The code and the model weights are publicly available at https://github.com/parthpatwa/Hater-O-Genius-Aggression-Classification-using-Capsule-Networks.

* Accepted at the 17th International Conference on Natural Language Processing (ICON 2020) 
Viaarxiv icon

Generating Out of Distribution Adversarial Attack using Latent Space Poisoning

Dec 09, 2020
Ujjwal Upadhyay, Prerana Mukherjee

Figure 1 for Generating Out of Distribution Adversarial Attack using Latent Space Poisoning
Figure 2 for Generating Out of Distribution Adversarial Attack using Latent Space Poisoning
Figure 3 for Generating Out of Distribution Adversarial Attack using Latent Space Poisoning
Figure 4 for Generating Out of Distribution Adversarial Attack using Latent Space Poisoning

Traditional adversarial attacks rely upon the perturbations generated by gradients from the network which are generally safeguarded by gradient guided search to provide an adversarial counterpart to the network. In this paper, we propose a novel mechanism of generating adversarial examples where the actual image is not corrupted rather its latent space representation is utilized to tamper with the inherent structure of the image while maintaining the perceptual quality intact and to act as legitimate data samples. As opposed to gradient-based attacks, the latent space poisoning exploits the inclination of classifiers to model the independent and identical distribution of the training dataset and tricks it by producing out of distribution samples. We train a disentangled variational autoencoder (beta-VAE) to model the data in latent space and then we add noise perturbations using a class-conditioned distribution function to the latent space under the constraint that it is misclassified to the target label. Our empirical results on MNIST, SVHN, and CelebA dataset validate that the generated adversarial examples can easily fool robust l_0, l_2, l_inf norm classifiers designed using provably robust defense mechanisms.

* Submitted to IEEE SPL 
Viaarxiv icon

Attentional networks for music generation

Feb 06, 2020
Gullapalli Keerti, A N Vaishnavi, Prerana Mukherjee, A Sree Vidya, Gattineni Sai Sreenithya, Deeksha Nayab

Figure 1 for Attentional networks for music generation
Figure 2 for Attentional networks for music generation
Figure 3 for Attentional networks for music generation
Figure 4 for Attentional networks for music generation

Realistic music generation has always remained as a challenging problem as it may lack structure or rationality. In this work, we propose a deep learning based music generation method in order to produce old style music particularly JAZZ with rehashed melodic structures utilizing a Bi-directional Long Short Term Memory (Bi-LSTM) Neural Network with Attention. Owing to the success in modelling long-term temporal dependencies in sequential data and its success in case of videos, Bi-LSTMs with attention serve as the natural choice and early utilization in music generation. We validate in our experiments that Bi-LSTMs with attention are able to preserve the richness and technical nuances of the music performed.

Viaarxiv icon

AnimePose: Multi-person 3D pose estimation and animation

Feb 06, 2020
Laxman Kumarapu, Prerana Mukherjee

Figure 1 for AnimePose: Multi-person 3D pose estimation and animation
Figure 2 for AnimePose: Multi-person 3D pose estimation and animation
Figure 3 for AnimePose: Multi-person 3D pose estimation and animation
Figure 4 for AnimePose: Multi-person 3D pose estimation and animation

3D animation of humans in action is quite challenging as it involves using a huge setup with several motion trackers all over the person's body to track the movements of every limb. This is time-consuming and may cause the person discomfort in wearing exoskeleton body suits with motion sensors. In this work, we present a trivial yet effective solution to generate 3D animation of multiple persons from a 2D video using deep learning. Although significant improvement has been achieved recently in 3D human pose estimation, most of the prior works work well in case of single person pose estimation and multi-person pose estimation is still a challenging problem. In this work, we firstly propose a supervised multi-person 3D pose estimation and animation framework namely AnimePose for a given input RGB video sequence. The pipeline of the proposed system consists of various modules: i) Person detection and segmentation, ii) Depth Map estimation, iii) Lifting 2D to 3D information for person localization iv) Person trajectory prediction and human pose tracking. Our proposed system produces comparable results on previous state-of-the-art 3D multi-person pose estimation methods on publicly available datasets MuCo-3DHP and MuPoTS-3D datasets and it also outperforms previous state-of-the-art human pose tracking methods by a significant margin of 11.7% performance gain on MOTA score on Posetrack 2018 dataset.

* arXiv admin note: text overlap with arXiv:1907.11346 by other authors 
Viaarxiv icon

Aerial multi-object tracking by detection using deep association networks

Sep 04, 2019
Ajit Jadhav, Prerana Mukherjee, Vinay Kaushik, Brejesh Lall

Figure 1 for Aerial multi-object tracking by detection using deep association networks
Figure 2 for Aerial multi-object tracking by detection using deep association networks
Figure 3 for Aerial multi-object tracking by detection using deep association networks
Figure 4 for Aerial multi-object tracking by detection using deep association networks

A lot a research is focused on object detection and it has achieved significant advances with deep learning techniques in recent years. Inspite of the existing research, these algorithms are not usually optimal for dealing with sequences or images captured by drone-based platforms, due to various challenges such as view point change, scales, density of object distribution and occlusion. In this paper, we develop a model for detection of objects in drone images using the VisDrone2019 DET dataset. Using the RetinaNet model as our base, we modify the anchor scales to better handle the detection of dense distribution and small size of the objects. We explicitly model the channel interdependencies by using "Squeeze-and-Excitation" (SE) blocks that adaptively recalibrates channel-wise feature responses. This helps to bring significant improvements in performance at a slight additional computational cost. Using this architecture for object detection, we build a custom DeepSORT network for object detection on the VisDrone2019 MOT dataset by training a custom Deep Association network for the algorithm.

Viaarxiv icon

Multi-level Attention network using text, audio and video for Depression Prediction

Sep 03, 2019
Anupama Ray, Siddharth Kumar, Rutvik Reddy, Prerana Mukherjee, Ritu Garg

Figure 1 for Multi-level Attention network using text, audio and video for Depression Prediction
Figure 2 for Multi-level Attention network using text, audio and video for Depression Prediction
Figure 3 for Multi-level Attention network using text, audio and video for Depression Prediction

Depression has been the leading cause of mental-health illness worldwide. Major depressive disorder (MDD), is a common mental health disorder that affects both psychologically as well as physically which could lead to loss of lives. Due to the lack of diagnostic tests and subjectivity involved in detecting depression, there is a growing interest in using behavioural cues to automate depression diagnosis and stage prediction. The absence of labelled behavioural datasets for such problems and the huge amount of variations possible in behaviour makes the problem more challenging. This paper presents a novel multi-level attention based network for multi-modal depression prediction that fuses features from audio, video and text modalities while learning the intra and inter modality relevance. The multi-level attention reinforces overall learning by selecting the most influential features within each modality for the decision making. We perform exhaustive experimentation to create different regression models for audio, video and text modalities. Several fusions models with different configurations are constructed to understand the impact of each feature and modality. We outperform the current baseline by 17.52% in terms of root mean squared error.

* in Proceedings of the 9th International Workshop on Audio/Visual Emotion Challenge, AVEC 2019, ACM Multimedia Workshop, Nice, France 
Viaarxiv icon

A Light weight and Hybrid Deep Learning Model based Online Signature Verification

Jul 09, 2019
Chandra Sekhar V., Anoushka Doctor, Prerana Mukherjee, Viswanath Pulabaigiri

Figure 1 for A Light weight and Hybrid Deep Learning Model based Online Signature Verification
Figure 2 for A Light weight and Hybrid Deep Learning Model based Online Signature Verification
Figure 3 for A Light weight and Hybrid Deep Learning Model based Online Signature Verification
Figure 4 for A Light weight and Hybrid Deep Learning Model based Online Signature Verification

The augmented usage of deep learning-based models for various AI related problems are as a result of modern architectures of deeper length and the availability of voluminous interpreted datasets. The models based on these architectures require huge training and storage cost, which makes them inefficient to use in critical applications like online signature verification (OSV) and to deploy in resource constraint devices. As a solution, in this work, our contribution is two-fold. 1) An efficient dimensionality reduction technique, to reduce the number of features to be considered and 2) a state-of-the-art model CNN-LSTM based hybrid architecture for online signature verification. Thorough experiments on the publicly available datasets MCYT, SUSIG, SVC confirms that the proposed model achieves better accuracy even with as low as one training sample. The proposed models yield state-of-the-art performance in various categories of all the three datasets.

* accepted in ICDAR-WML: The 2nd International Workshop on Machine Learning 2019 
Viaarxiv icon