Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gita Sukthankar

Visual Episodic Memory-based Exploration

May 18, 2024

Jack Vice, Natalie Ruiz-Sanchez, Pamela K. Douglas, Gita Sukthankar

Figure 1 for Visual Episodic Memory-based Exploration

Figure 2 for Visual Episodic Memory-based Exploration

Figure 3 for Visual Episodic Memory-based Exploration

Figure 4 for Visual Episodic Memory-based Exploration

Abstract:In humans, intrinsic motivation is an important mechanism for open-ended cognitive development; in robots, it has been shown to be valuable for exploration. An important aspect of human cognitive development is $\textit{episodic memory}$ which enables both the recollection of events from the past and the projection of subjective future. This paper explores the use of visual episodic memory as a source of intrinsic motivation for robotic exploration problems. Using a convolutional recurrent neural network autoencoder, the agent learns an efficient representation for spatiotemporal features such that accurate sequence prediction can only happen once spatiotemporal features have been learned. Structural similarity between ground truth and autoencoder generated images is used as an intrinsic motivation signal to guide exploration. Our proposed episodic memory model also implicitly accounts for the agent's actions, motivating the robot to seek new interactive experiences rather than just areas that are visually dissimilar. When guiding robotic exploration, our proposed method outperforms the Curiosity-driven Variational Autoencoder (CVAE) at finding dynamic anomalies.

* The International FLAIRS Conference Proceedings. Vol. 36. 2023
* FLAIRS 2023, 7 pages, 11 figures

Via

Access Paper or Ask Questions

Smart Sampling: Self-Attention and Bootstrapping for Improved Ensembled Q-Learning

May 14, 2024

Muhammad Junaid Khan, Syed Hammad Ahmed, Gita Sukthankar

Figure 1 for Smart Sampling: Self-Attention and Bootstrapping for Improved Ensembled Q-Learning

Figure 2 for Smart Sampling: Self-Attention and Bootstrapping for Improved Ensembled Q-Learning

Figure 3 for Smart Sampling: Self-Attention and Bootstrapping for Improved Ensembled Q-Learning

Figure 4 for Smart Sampling: Self-Attention and Bootstrapping for Improved Ensembled Q-Learning

Abstract:We present a novel method aimed at enhancing the sample efficiency of ensemble Q learning. Our proposed approach integrates multi-head self-attention into the ensembled Q networks while bootstrapping the state-action pairs ingested by the ensemble. This not only results in performance improvements over the original REDQ (Chen et al. 2021) and its variant DroQ (Hi-raoka et al. 2022), thereby enhancing Q predictions, but also effectively reduces both the average normalized bias and standard deviation of normalized bias within Q-function ensembles. Importantly, our method also performs well even in scenarios with a low update-to-data (UTD) ratio. Notably, the implementation of our proposed method is straightforward, requiring minimal modifications to the base model.

* FLAIRS-37 (2024)

Via

Access Paper or Ask Questions

Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion

May 09, 2024

Syed Hammad Ahmed, Muhammad Junaid Khan, Gita Sukthankar

Figure 1 for Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion

Figure 2 for Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion

Figure 3 for Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion

Figure 4 for Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion

Abstract:Due to the rise in video content creation targeted towards children, there is a need for robust content moderation schemes for video hosting platforms. A video that is visually benign may include audio content that is inappropriate for young children while being impossible to detect with a unimodal content moderation system. Popular video hosting platforms for children such as YouTube Kids still publish videos which contain audio content that is not conducive to a child's healthy behavioral and physical development. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. To address this, we present an efficient adaptation of CLIP (Contrastive Language-Image Pre-training) that can leverage contextual audio cues for enhanced content moderation. We incorporate 1) the audio modality and 2) prompt learning, while keeping the backbone modules of each modality frozen. We conduct our experiments on a multimodal version of the MOB (Malicious or Benign) dataset in supervised and few-shot settings.

* 8 pages, 3 figures, Accepted at The 37th International FLAIRS Conference

Via

Access Paper or Ask Questions

The Potential of Vision-Language Models for Content Moderation of Children's Videos

Dec 06, 2023

Syed Hammad Ahmed, Shengnan Hu, Gita Sukthankar

Figure 1 for The Potential of Vision-Language Models for Content Moderation of Children's Videos

Figure 2 for The Potential of Vision-Language Models for Content Moderation of Children's Videos

Figure 3 for The Potential of Vision-Language Models for Content Moderation of Children's Videos

Figure 4 for The Potential of Vision-Language Models for Content Moderation of Children's Videos

Abstract:Natural language supervision has been shown to be effective for zero-shot learning in many computer vision tasks, such as object detection and activity recognition. However, generating informative prompts can be challenging for more subtle tasks, such as video content moderation. This can be difficult, as there are many reasons why a video might be inappropriate, beyond violence and obscenity. For example, scammers may attempt to create junk content that is similar to popular educational videos but with no meaningful information. This paper evaluates the performance of several CLIP variations for content moderation of children's cartoons in both the supervised and zero-shot setting. We show that our proposed model (Vanilla CLIP with Projection Layer) outperforms previous work conducted on the Malicious or Benign (MOB) benchmark for video content moderation. This paper presents an in depth analysis of how context-specific language prompts affect content moderation performance. Our results indicate that it is important to include more context in content moderation prompts, particularly for cartoon videos as they are not well represented in the CLIP training data.

* 5 pages, 1 figure. Accepted at IEEE ICMLA 2023

Via

Access Paper or Ask Questions

LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Jul 26, 2023

Shengnan Hu, Ce Zheng, Zixiang Zhou, Chen Chen, Gita Sukthankar

Figure 1 for LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Figure 2 for LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Figure 3 for LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Figure 4 for LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Abstract:Human-centric visual understanding is an important desideratum for effective human-robot interaction. In order to navigate crowded public places, social robots must be able to interpret the activity of the surrounding humans. This paper addresses one key aspect of human-centric visual understanding, multi-person pose estimation. Achieving good performance on multi-person pose estimation in crowded scenes is difficult due to the challenges of occluded joints and instance separation. In order to tackle these challenges and overcome the limitations of image features in representing invisible body parts, we propose a novel prompt-based pose inference strategy called LAMP (Language Assisted Multi-person Pose estimation). By utilizing the text representations generated by a well-trained language model (CLIP), LAMP can facilitate the understanding of poses on the instance and joint levels, and learn more robust visual representations that are less susceptible to occlusion. This paper demonstrates that language-supervised training boosts the performance of single-stage multi-person pose estimation, and both instance-level and joint-level prompts are valuable for training. The code is available at https://github.com/shengnanh20/LAMP.

Via

Access Paper or Ask Questions

Malicious or Benign? Towards Effective Content Moderation for Children's Videos

May 24, 2023

Syed Hammad Ahmed, Muhammad Junaid Khan, H. M. Umer Qaisar, Gita Sukthankar

Figure 1 for Malicious or Benign? Towards Effective Content Moderation for Children's Videos

Figure 2 for Malicious or Benign? Towards Effective Content Moderation for Children's Videos

Figure 3 for Malicious or Benign? Towards Effective Content Moderation for Children's Videos

Figure 4 for Malicious or Benign? Towards Effective Content Moderation for Children's Videos

Abstract:Online video platforms receive hundreds of hours of uploads every minute, making manual content moderation impossible. Unfortunately, the most vulnerable consumers of malicious video content are children from ages 1-5 whose attention is easily captured by bursts of color and sound. Scammers attempting to monetize their content may craft malicious children's videos that are superficially similar to educational videos, but include scary and disgusting characters, violent motions, loud music, and disturbing noises. Prominent video hosting platforms like YouTube have taken measures to mitigate malicious content on their platform, but these videos often go undetected by current content moderation tools that are focused on removing pornographic or copyrighted content. This paper introduces our toolkit Malicious or Benign for promoting research on automated content moderation of children's videos. We present 1) a customizable annotation tool for videos, 2) a new dataset with difficult to detect test cases of malicious content and 3) a benchmark suite of state-of-the-art video classification models.

* The International FLAIRS Conference Proceedings. 36, 1 (May 2023)
* 10 pages, 7 figures, The 36th International FLAIRS Conference

Via

Access Paper or Ask Questions

Improving the Generalizability of Collaborative Dialogue Analysis with Multi-Feature Embeddings

Feb 09, 2023

Ayesha Enayet, Gita Sukthankar

Figure 1 for Improving the Generalizability of Collaborative Dialogue Analysis with Multi-Feature Embeddings

Figure 2 for Improving the Generalizability of Collaborative Dialogue Analysis with Multi-Feature Embeddings

Figure 3 for Improving the Generalizability of Collaborative Dialogue Analysis with Multi-Feature Embeddings

Figure 4 for Improving the Generalizability of Collaborative Dialogue Analysis with Multi-Feature Embeddings

Abstract:Conflict prediction in communication is integral to the design of virtual agents that support successful teamwork by providing timely assistance. The aim of our research is to analyze discourse to predict collaboration success. Unfortunately, resource scarcity is a problem that teamwork researchers commonly face since it is hard to gather a large number of training examples. To alleviate this problem, this paper introduces a multi-feature embedding (MFeEmb) that improves the generalizability of conflict prediction models trained on dialogue sequences. MFeEmb leverages textual, structural, and semantic information from the dialogues by incorporating lexical, dialogue acts, and sentiment features. The use of dialogue acts and sentiment features reduces performance loss from natural distribution shifts caused mainly by changes in vocabulary. This paper demonstrates the performance of MFeEmb on domain adaptation problems in which the model is trained on discourse from one task domain and applied to predict team performance in a different domain. The generalizability of MFeEmb is quantified using the similarity measure proposed by Bontonou et al. (2021). Our results show that MFeEmb serves as an excellent domain-agnostic representation for meta-pretraining a few-shot model on collaborative multiparty dialogues.

* To be published in the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023)

Via

Access Paper or Ask Questions

Through a fair looking-glass: mitigating bias in image datasets

Sep 18, 2022

Amirarsalan Rajabi, Mehdi Yazdani-Jahromi, Ozlem Ozmen Garibay, Gita Sukthankar

Figure 1 for Through a fair looking-glass: mitigating bias in image datasets

Figure 2 for Through a fair looking-glass: mitigating bias in image datasets

Figure 3 for Through a fair looking-glass: mitigating bias in image datasets

Figure 4 for Through a fair looking-glass: mitigating bias in image datasets

Abstract:With the recent growth in computer vision applications, the question of how fair and unbiased they are has yet to be explored. There is abundant evidence that the bias present in training data is reflected in the models, or even amplified. Many previous methods for image dataset de-biasing, including models based on augmenting datasets, are computationally expensive to implement. In this study, we present a fast and effective model to de-bias an image dataset through reconstruction and minimizing the statistical dependence between intended variables. Our architecture includes a U-net to reconstruct images, combined with a pre-trained classifier which penalizes the statistical dependence between target attribute and the protected attribute. We evaluate our proposed model on CelebA dataset, compare the results with a state-of-the-art de-biasing method, and show that the model achieves a promising fairness-accuracy combination.

Via

Access Paper or Ask Questions

Transformer-based Value Function Decomposition for Cooperative Multi-agent Reinforcement Learning in StarCraft

Aug 15, 2022

Muhammad Junaid Khan, Syed Hammad Ahmed, Gita Sukthankar

Figure 1 for Transformer-based Value Function Decomposition for Cooperative Multi-agent Reinforcement Learning in StarCraft

Figure 2 for Transformer-based Value Function Decomposition for Cooperative Multi-agent Reinforcement Learning in StarCraft

Figure 3 for Transformer-based Value Function Decomposition for Cooperative Multi-agent Reinforcement Learning in StarCraft

Figure 4 for Transformer-based Value Function Decomposition for Cooperative Multi-agent Reinforcement Learning in StarCraft

Abstract:The StarCraft II Multi-Agent Challenge (SMAC) was created to be a challenging benchmark problem for cooperative multi-agent reinforcement learning (MARL). SMAC focuses exclusively on the problem of StarCraft micromanagement and assumes that each unit is controlled individually by a learning agent that acts independently and only possesses local information; centralized training is assumed to occur with decentralized execution (CTDE). To perform well in SMAC, MARL algorithms must handle the dual problems of multi-agent credit assignment and joint action evaluation. This paper introduces a new architecture TransMix, a transformer-based joint action-value mixing network which we show to be efficient and scalable as compared to the other state-of-the-art cooperative MARL solutions. TransMix leverages the ability of transformers to learn a richer mixing function for combining the agents' individual value functions. It achieves comparable performance to previous work on easy SMAC scenarios and outperforms other techniques on hard scenarios, as well as scenarios that are corrupted with Gaussian noise to simulate fog of war.

* AIIDE 2022

Via

Access Paper or Ask Questions

Predicting Team Performance with Spatial Temporal Graph Convolutional Networks

Jun 21, 2022

Shengnan Hu, Gita Sukthankar

Figure 1 for Predicting Team Performance with Spatial Temporal Graph Convolutional Networks

Figure 2 for Predicting Team Performance with Spatial Temporal Graph Convolutional Networks

Figure 3 for Predicting Team Performance with Spatial Temporal Graph Convolutional Networks

Figure 4 for Predicting Team Performance with Spatial Temporal Graph Convolutional Networks

Abstract:This paper presents a new approach for predicting team performance from the behavioral traces of a set of agents. This spatiotemporal forecasting problem is very relevant to sports analytics challenges such as coaching and opponent modeling. We demonstrate that our proposed model, Spatial Temporal Graph Convolutional Networks (ST-GCN), outperforms other classification techniques at predicting game score from a short segment of player movement and game features. Our proposed architecture uses a graph convolutional network to capture the spatial relationships between team members and Gated Recurrent Units to analyze dynamic motion information. An ablative evaluation was performed to demonstrate the contributions of different aspects of our architecture.

* International Conference on Pattern Recognition (ICPR), 2022

Via

Access Paper or Ask Questions