Over the past decade, the intricacies of sports-related concussions among female athletes have become readily apparent. Traditional clinical methods for diagnosing concussions suffer limitations when applied to female athletes, often failing to capture subtle changes in brain structure and function. Advanced neuroinformatics techniques and machine learning models have become invaluable assets in this endeavor. While these technologies have been extensively employed in understanding concussion in male athletes, there remains a significant gap in our comprehension of their effectiveness for female athletes. With its remarkable data analysis capacity, machine learning offers a promising avenue to bridge this deficit. By harnessing the power of machine learning, researchers can link observed phenotypic neuroimaging data to sex-specific biological mechanisms, unraveling the mysteries of concussions in female athletes. Furthermore, embedding methods within machine learning enable examining brain architecture and its alterations beyond the conventional anatomical reference frame. In turn, allows researchers to gain deeper insights into the dynamics of concussions, treatment responses, and recovery processes. To guarantee that female athletes receive the optimal care they deserve, researchers must employ advanced neuroimaging techniques and sophisticated machine-learning models. These tools enable an in-depth investigation of the underlying mechanisms responsible for concussion symptoms stemming from neuronal dysfunction in female athletes. This paper endeavors to address the crucial issue of sex differences in multimodal neuroimaging experimental design and machine learning approaches within female athlete populations, ultimately ensuring that they receive the tailored care they require when facing the challenges of concussions.
The recent wave of generative AI has sparked unprecedented global attention, with both excitement and concern over potentially superhuman levels of artificial intelligence: models now take only seconds to produce outputs that would challenge or exceed the capabilities even of expert humans. At the same time, models still show basic errors in understanding that would not be expected even in non-expert humans. This presents us with an apparent paradox: how do we reconcile seemingly superhuman capabilities with the persistence of errors that few humans would make? In this work, we posit that this tension reflects a divergence in the configuration of intelligence in today's generative models relative to intelligence in humans. Specifically, we propose and test the Generative AI Paradox hypothesis: generative models, having been trained directly to reproduce expert-like outputs, acquire generative capabilities that are not contingent upon -- and can therefore exceed -- their ability to understand those same types of outputs. This contrasts with humans, for whom basic understanding almost always precedes the ability to generate expert-level outputs. We test this hypothesis through controlled experiments analyzing generation vs. understanding in generative models, across both language and image modalities. Our results show that although models can outperform humans in generation, they consistently fall short of human capabilities in measures of understanding, as well as weaker correlation between generation and understanding performance, and more brittleness to adversarial inputs. Our findings support the hypothesis that models' generative capability may not be contingent upon understanding capability, and call for caution in interpreting artificial intelligence by analogy to human intelligence.
Many real-world applications require surfacing extracted snippets to users, whether motivated by assistive tools for literature surveys or document cross-referencing, or needs to mitigate and recover from model generated inaccuracies., Yet, these passages can be difficult to consume when divorced from their original document context. In this work, we explore the limits of LLMs to perform decontextualization of document snippets in user-facing scenarios, focusing on two real-world settings - question answering and citation context previews for scientific documents. We propose a question-answering framework for decontextualization that allows for better handling of user information needs and preferences when determining the scope of rewriting. We present results showing state-of-the-art LLMs under our framework remain competitive with end-to-end approaches. We also explore incorporating user preferences into the system, finding our framework allows for controllability.
Traditionally, writing assistance systems have focused on short or even single-word suggestions. Recently, large language models like GPT-3 have made it possible to generate significantly longer natural-sounding suggestions, offering more advanced assistance opportunities. This study explores the trade-offs between sentence- vs. message-level suggestions for AI-mediated communication. We recruited 120 participants to act as staffers from legislators' offices who often need to respond to large volumes of constituent concerns. Participants were asked to reply to emails with different types of assistance. The results show that participants receiving message-level suggestions responded faster and were more satisfied with the experience, as they mainly edited the suggested drafts. In addition, the texts they wrote were evaluated as more helpful by others. In comparison, participants receiving sentence-level assistance retained a higher sense of agency, but took longer for the task as they needed to plan the flow of their responses and decide when to use suggestions. Our findings have implications for designing task-appropriate communication assistance systems.
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
* Authored by the Center for Research on Foundation Models (CRFM) at
the Stanford Institute for Human-Centered Artificial Intelligence (HAI).
Project page: https://crfm.stanford.edu/helm/v1.0
Recent work (e.g. LAMA (Petroni et al., 2019)) has found that the quality of the factual information extracted from Large Language Models (LLMs) depends on the prompts used to query them. This inconsistency is problematic because different users will query LLMs for the same information using different wording, but should receive the same, accurate responses regardless. In this work we aim to address this shortcoming by introducing P-Adapters: lightweight models that sit between the embedding layer and first attention layer of LLMs. They take LLM embeddings as input and output continuous prompts that are used to query the LLM. Additionally, we investigate Mixture of Experts (MoE) models that learn a set of continuous prompts ("experts") and select one to query the LLM. They require a separate classifier trained on human-annotated data to map natural language prompts to the continuous ones. P-Adapters perform comparably to the more complex MoE models in extracting factual information from BERT and RoBERTa while eliminating the need for additional annotations. P-Adapters show between 12-26% absolute improvement in precision and 36-50% absolute improvement in consistency over a baseline of only using natural language queries. Finally, we investigate what makes a P-adapter successful and conclude that access to the LLM's embeddings of the original natural language prompt, particularly the subject of the entity pair being asked about, is a significant factor.
Targeted syntactic evaluation of subject-verb number agreement in English (TSE) evaluates language models' syntactic knowledge using hand-crafted minimal pairs of sentences that differ only in the main verb's conjugation. The method evaluates whether language models rate each grammatical sentence as more likely than its ungrammatical counterpart. We identify two distinct goals for TSE. First, evaluating the systematicity of a language model's syntactic knowledge: given a sentence, can it conjugate arbitrary verbs correctly? Second, evaluating a model's likely behavior: given a sentence, does the model concentrate its probability mass on correctly conjugated verbs, even if only on a subset of the possible verbs? We argue that current implementations of TSE do not directly capture either of these goals, and propose new metrics to capture each goal separately. Under our metrics, we find that TSE overestimates systematicity of language models, but that models score up to 40% better on verbs that they predict are likely in context.
* 14 pages, 5 figures, 3 tables. To appear at NAACL 2021
Augmented-reality (AR) glasses that will have access to onboard sensors and an ability to display relevant information to the user present an opportunity to provide user assistance in quotidian tasks. Many such tasks can be characterized as object-rearrangement tasks. We introduce a novel framework for computing and displaying AR assistance that consists of (1) associating an optimal action sequence with the policy of an embodied agent and (2) presenting this sequence to the user as suggestions in the AR system's heads-up display. The embodied agent comprises a "hybrid" between the AR system and the user, with the AR system's observation space (i.e., sensors) and the user's action space (i.e., task-execution actions); its policy is learned by minimizing the task-completion time. In this initial study, we assume that the AR system's observations include the environment's map and localization of the objects and the user. These choices allow us to formalize the problem of computing AR assistance for any object-rearrangement task as a planning problem, specifically as a capacitated vehicle-routing problem. Further, we introduce a novel AR simulator that can enable web-based evaluation of AR-like assistance and associated at-scale data collection via the Habitat simulator for embodied artificial intelligence. Finally, we perform a study that evaluates user response to the proposed form of AR assistance on a specific quotidian object-rearrangement task, house cleaning, using our proposed AR simulator on mechanical turk. In particular, we study the effect of the proposed AR assistance on users' task performance and sense of agency over a range of task difficulties. Our results indicate that providing users with such assistance improves their overall performance and while users report a negative impact to their agency, they may still prefer the proposed assistance to having no assistance at all.
* 19 pages including supplementary. Under review for ACM IUI 2021
Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence length at test time - to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40% improvement over +EOS in the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction.
Natural language generation (NLG) systems are commonly evaluated using n-gram overlap measures (e.g. BLEU, ROUGE). These measures do not directly capture semantics or speaker intentions, and so they often turn out to be misaligned with our true goals for NLG. In this work, we argue instead for communication-based evaluations: assuming the purpose of an NLG system is to convey information to a reader/listener, we can directly evaluate its effectiveness at this task using the Rational Speech Acts model of pragmatic language use. We illustrate with a color reference dataset that contains descriptions in pre-defined quality categories, showing that our method better aligns with these quality categories than do any of the prominent n-gram overlap methods.