Natural language supervision has been shown to be effective for zero-shot learning in many computer vision tasks, such as object detection and activity recognition. However, generating informative prompts can be challenging for more subtle tasks, such as video content moderation. This can be difficult, as there are many reasons why a video might be inappropriate, beyond violence and obscenity. For example, scammers may attempt to create junk content that is similar to popular educational videos but with no meaningful information. This paper evaluates the performance of several CLIP variations for content moderation of children's cartoons in both the supervised and zero-shot setting. We show that our proposed model (Vanilla CLIP with Projection Layer) outperforms previous work conducted on the Malicious or Benign (MOB) benchmark for video content moderation. This paper presents an in depth analysis of how context-specific language prompts affect content moderation performance. Our results indicate that it is important to include more context in content moderation prompts, particularly for cartoon videos as they are not well represented in the CLIP training data.
Human-centric visual understanding is an important desideratum for effective human-robot interaction. In order to navigate crowded public places, social robots must be able to interpret the activity of the surrounding humans. This paper addresses one key aspect of human-centric visual understanding, multi-person pose estimation. Achieving good performance on multi-person pose estimation in crowded scenes is difficult due to the challenges of occluded joints and instance separation. In order to tackle these challenges and overcome the limitations of image features in representing invisible body parts, we propose a novel prompt-based pose inference strategy called LAMP (Language Assisted Multi-person Pose estimation). By utilizing the text representations generated by a well-trained language model (CLIP), LAMP can facilitate the understanding of poses on the instance and joint levels, and learn more robust visual representations that are less susceptible to occlusion. This paper demonstrates that language-supervised training boosts the performance of single-stage multi-person pose estimation, and both instance-level and joint-level prompts are valuable for training. The code is available at https://github.com/shengnanh20/LAMP.
Online video platforms receive hundreds of hours of uploads every minute, making manual content moderation impossible. Unfortunately, the most vulnerable consumers of malicious video content are children from ages 1-5 whose attention is easily captured by bursts of color and sound. Scammers attempting to monetize their content may craft malicious children's videos that are superficially similar to educational videos, but include scary and disgusting characters, violent motions, loud music, and disturbing noises. Prominent video hosting platforms like YouTube have taken measures to mitigate malicious content on their platform, but these videos often go undetected by current content moderation tools that are focused on removing pornographic or copyrighted content. This paper introduces our toolkit Malicious or Benign for promoting research on automated content moderation of children's videos. We present 1) a customizable annotation tool for videos, 2) a new dataset with difficult to detect test cases of malicious content and 3) a benchmark suite of state-of-the-art video classification models.
Conflict prediction in communication is integral to the design of virtual agents that support successful teamwork by providing timely assistance. The aim of our research is to analyze discourse to predict collaboration success. Unfortunately, resource scarcity is a problem that teamwork researchers commonly face since it is hard to gather a large number of training examples. To alleviate this problem, this paper introduces a multi-feature embedding (MFeEmb) that improves the generalizability of conflict prediction models trained on dialogue sequences. MFeEmb leverages textual, structural, and semantic information from the dialogues by incorporating lexical, dialogue acts, and sentiment features. The use of dialogue acts and sentiment features reduces performance loss from natural distribution shifts caused mainly by changes in vocabulary. This paper demonstrates the performance of MFeEmb on domain adaptation problems in which the model is trained on discourse from one task domain and applied to predict team performance in a different domain. The generalizability of MFeEmb is quantified using the similarity measure proposed by Bontonou et al. (2021). Our results show that MFeEmb serves as an excellent domain-agnostic representation for meta-pretraining a few-shot model on collaborative multiparty dialogues.
With the recent growth in computer vision applications, the question of how fair and unbiased they are has yet to be explored. There is abundant evidence that the bias present in training data is reflected in the models, or even amplified. Many previous methods for image dataset de-biasing, including models based on augmenting datasets, are computationally expensive to implement. In this study, we present a fast and effective model to de-bias an image dataset through reconstruction and minimizing the statistical dependence between intended variables. Our architecture includes a U-net to reconstruct images, combined with a pre-trained classifier which penalizes the statistical dependence between target attribute and the protected attribute. We evaluate our proposed model on CelebA dataset, compare the results with a state-of-the-art de-biasing method, and show that the model achieves a promising fairness-accuracy combination.
The StarCraft II Multi-Agent Challenge (SMAC) was created to be a challenging benchmark problem for cooperative multi-agent reinforcement learning (MARL). SMAC focuses exclusively on the problem of StarCraft micromanagement and assumes that each unit is controlled individually by a learning agent that acts independently and only possesses local information; centralized training is assumed to occur with decentralized execution (CTDE). To perform well in SMAC, MARL algorithms must handle the dual problems of multi-agent credit assignment and joint action evaluation. This paper introduces a new architecture TransMix, a transformer-based joint action-value mixing network which we show to be efficient and scalable as compared to the other state-of-the-art cooperative MARL solutions. TransMix leverages the ability of transformers to learn a richer mixing function for combining the agents' individual value functions. It achieves comparable performance to previous work on easy SMAC scenarios and outperforms other techniques on hard scenarios, as well as scenarios that are corrupted with Gaussian noise to simulate fog of war.
This paper presents a new approach for predicting team performance from the behavioral traces of a set of agents. This spatiotemporal forecasting problem is very relevant to sports analytics challenges such as coaching and opponent modeling. We demonstrate that our proposed model, Spatial Temporal Graph Convolutional Networks (ST-GCN), outperforms other classification techniques at predicting game score from a short segment of player movement and game features. Our proposed architecture uses a graph convolutional network to capture the spatial relationships between team members and Gated Recurrent Units to analyze dynamic motion information. An ablative evaluation was performed to demonstrate the contributions of different aspects of our architecture.
Optimizing gait stability for legged robots is a difficult problem. Even on level surfaces, effectively traversing across different textures (e.g., carpet) rests on dynamically tuning parameters in multidimensional space. Inspired by biology, evolutionary algorithms (EA) remain an attractive solution for feasibly implementing robotic locomotion with both energetic economy and rapid parameter convergence. Here, we leveraged this class of algorithms to evolve a stable hexapod gait controller capable of traversing uneven terrain and obstacles. Gait parameters were evolved in a rigid body dynamics simulation on an 8 x 3 meter obstacle course comprised of random step field, linear obstacles and inclined surfaces. Using a fitness function that jointly optimized locomotion velocity and stability, we found that multiple successful gait parameter evolutions yielded specialized functionality for each leg. Specific gait parameters were identified as critical to developing a rough terrain gait.
Inspired by the recent success of transformers in natural language processing and computer vision applications, we introduce a transformer-based neural architecture for two key StarCraft II (SC2) macromanagement tasks: global state and build order prediction. Unlike recurrent neural networks which suffer from a recency bias, transformers are able to capture patterns across very long time horizons, making them well suited for full game analysis. Our model utilizes the MSC (Macromanagement in StarCraft II) dataset and improves on the top performing gated recurrent unit (GRU) architecture in predicting global state and build order as measured by mean accuracy over multiple time horizons. We present ablation studies on our proposed architecture that support our design decisions. One key advantage of transformers is their ability to generalize well, and we demonstrate that our model achieves an even better accuracy when used in a transfer learning setting in which models trained on games with one racial matchup (e.g., Terran vs. Protoss) are transferred to a different one. We believe that transformers' ability to model long games, potential for parallelization, and generalization performance make them an excellent choice for StarCraft agents.