Fairness has emerged as an important concern in automated decision-making in recent years, especially when these decisions affect human welfare. In this work, we study fairness in temporally extended decision-making settings, specifically those formulated as Markov Decision Processes (MDPs). Our proposed notion of fairness ensures that each state's long-term visitation frequency is more than a specified fraction. In an average-reward MDP (AMDP) setting, we formulate the problem as a bilinear saddle point program and, for a generative model, solve it using a Stochastic Mirror Descent (SMD) based algorithm. The proposed solution guarantees a simultaneous approximation on the expected average-reward and the long-term state-visitation frequency. We validate our theoretical results with experiments on synthetic data.
Active learning (AL) algorithms may achieve better performance with fewer data because the model guides the data selection process. While many algorithms have been proposed, there is little study on what the optimal AL algorithm looks like, which would help researchers understand where their models fall short and iterate on the design. In this paper, we present a simulated annealing algorithm to search for this optimal oracle and analyze it for several different tasks. We present several qualitative and quantitative insights into the optimal behavior and contrast this behavior with those of various heuristics. When augmented by with one particular insight, heuristics perform consistently better. We hope that our findings can better inform future active learning research. The code for the experiments is available at https://github.com/YilunZhou/optimal-active-learning.
As robots are deployed in complex situations, engineers and end users must develop a holistic understanding of their capabilities and behaviors. Existing research focuses mainly on factors related to task completion, such as success rate, completion time, or total energy consumption. Other factors like collision avoidance behavior, trajectory smoothness, and motion legibility are equally or more important for safe and trustworthy deployment. While methods exist to analyze these quality factors for individual trajectories or distributions of trajectories, these statistics may be insufficient to develop a mental model of the controller's behaviors, especially uncommon behaviors. We present RoCUS: a Bayesian sampling-based method to find situations that lead to trajectories which exhibit certain behaviors. By analyzing these situations and trajectories, we can gain important insights into the controller that are easily missed in standard task-completion evaluations. On a 2D navigation problem and a 7 degree-of-freedom (DoF) arm reaching problem, we analyze three controllers: a rapidly exploring random tree (RRT) planner, a dynamical system (DS) formulation, and a deep imitation learning (IL) or reinforcement learning (RL) model. We show how RoCUS can uncover insights to further our understanding about them beyond task-completion aspects. The code is available at https://github.com/YilunZhou/RoCUS.
Building machine learning models requires a suite of tools for interpretation, understanding, and debugging. Many existing methods have been proposed, but it can still be difficult to probe for examples which communicate model behaviour. We introduce Bayes-Probe, a model inspection method for analyzing neural networks by generating distribution-conforming examples of known prediction confidence. By selecting appropriate distributions and confidence prediction values, Bayes-Probe can be used to synthesize ambivalent predictions, uncover in-distribution adversarial examples, and understand novel-class extrapolation and domain adaptation behaviours. Bayes-Probe is model agnostic, requiring only a data generator and classifier prediction. We use Bayes-Probe to analyze models trained on both procedurally-generated data (CLEVR) and organic data (MNIST and Fashion-MNIST). Code is available at https://github.com/serenabooth/Bayes-Probe.
Robotic agents must adopt existing social conventions in order to be effective teammates. These social conventions, such as driving on the right or left side of the road, are arbitrary choices among optimal policies, but all agents on a successful team must use the same convention. Prior work has identified a method of combining self-play with paired input-output data gathered from existing agents in order to learn their social convention without interacting with them. We build upon this work by introducing a technique called Adversarial Self-Play (ASP) that uses adversarial training to shape the space of possible learned policies and substantially improves learning efficiency. ASP only requires the addition of unpaired data: a dataset of outputs produced by the social convention without associated inputs. Theoretical analysis reveals how ASP shapes the policy space and the circumstances (when behaviors are clustered or exhibit some other structure) under which it offers the greatest benefits. Empirical results across three domains confirm ASP's advantages: it produces models that more closely match the desired social convention when given as few as two paired datapoints.
Though neural network models demonstrate impressive performance, we do not understand exactly how these black-box models make individual predictions. This drawback has led to substantial research devoted to understand these models in areas such as robustness, interpretability, and generalization ability. In this paper, we consider the problem of exploring the prediction level sets of a classifier using probabilistic programming. We define a prediction level set to be the set of examples for which the predictor has the same specified prediction confidence with respect to some arbitrary data distribution. Notably, our sampling-based method does not require the classifier to be differentiable, making it compatible with arbitrary classifiers. As a specific instantiation, if we take the classifier to be a neural network and the data distribution to be that of the training data, we can obtain examples that will result in specified predictions by the neural network. We demonstrate this technique with experiments on a synthetic dataset and MNIST. Such level sets in classification may facilitate human understanding of classification behaviors.
Commonsense procedural knowledge is important for AI agents and robots that operate in a human environment. While previous attempts at constructing procedural knowledge are mostly rule- and template-based, recent advances in deep learning provide the possibility of acquiring such knowledge directly from natural language sources. As a first step in this direction, we propose a model to learn embeddings for tasks, as well as the individual steps that need to be taken to solve them, based on WikiHow articles. We learn these embeddings such that they are predictive of both step relevance and step ordering. We also experiment with the use of integer programming for inferring consistent global step orderings from noisy pairwise predictions.
Multi-agent reinforcement learning (MARL) extends (single-agent) reinforcement learning (RL) by introducing additional agents and (potentially) partial observability of the environment. Consequently, algorithms for solving MARL problems incorporate various extensions beyond traditional RL methods, such as a learned communication protocol between cooperative agents that enables exchange of private information or adaptive modeling of opponents in competitive settings. One popular algorithmic construct is a memory mechanism such that an agent's decisions can depend not only upon the current state but also upon the history of observed states and actions. In this paper, we study how a memory mechanism can be useful in environments with different properties, such as observability, internality and presence of a communication channel. Using both prior work and new experiments, we show that a memory mechanism is helpful when learning agents need to model other agents and/or when communication is constrained in some way; however we must to be cautious of agents achieving effective memoryfulness through other means.
In many applications, it is important to characterize the way in which two concepts are semantically related. Knowledge graphs such as ConceptNet provide a rich source of information for such characterizations by encoding relations between concepts as edges in a graph. When two concepts are not directly connected by an edge, their relationship can still be described in terms of the paths that connect them. Unfortunately, many of these paths are uninformative and noisy, which means that the success of applications that use such path features crucially relies on their ability to select high-quality paths. In existing applications, this path selection process is based on relatively simple heuristics. In this paper we instead propose to learn to predict path quality from crowdsourced human assessments. Since we are interested in a generic task-independent notion of quality, we simply ask human participants to rank paths according to their subjective assessment of the paths' naturalness, without attempting to define naturalness or steering the participants towards particular indicators of quality. We show that a neural network model trained on these assessments is able to predict human judgments on unseen paths with near optimal performance. Most notably, we find that the resulting path selection method is substantially better than the current heuristic approaches at identifying meaningful paths.