Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Casey Kennington

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Apr 21, 2026

Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks, Yichi Zhang, Rada Mihalcea, Casey Kennington, Joyce Chai

Abstract:Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git

* Work accepted at ACL 2026 Findings

Via

Access Paper or Ask Questions

ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting

Apr 20, 2026

Clayton Fields, Casey Kennington

Abstract:Vision-language modeling is rapidly increasing in popularity with an ever expanding list of available models. In most cases, these vision-language models have parameters in the tens of billions, which is necessary for some needs, but in many cases smaller models are necessary (e.g., on edge devices or independent robotic platforms). Unfortunately, there is little research in producing light-weight models or in training them with small datasets. Inspired by the language learning progression and data sparsity in child development, in this paper, we address both of these goals in a systematic fashion. We show that two-tower encoder models are superior to one-tower encoders in low-resource settings for discriminative English tasks. We show also that incorporating traditional convolutional networks into the two-tower transformer architecture can help produce parameter efficient vision-language models. Finally, we show that the cross-modal fusion module of two-tower encoders can vary significantly in shape and size while producing the same results. In addition, we present ESsEN, a compact vision-language model that can be trained end-to-end with relatively few resources that performs as well on several tasks with only a fraction of the parameters compared to other models. The experimental results and the tools we present here make vision-language modeling more accessible to a wider variety of researchers.

Via

Access Paper or Ask Questions

Could the Road to Grounded, Neuro-symbolic AI be Paved with Words-as-Classifiers?

Jul 08, 2025

Casey Kennington, David Schlangen

Abstract:Formal, Distributional, and Grounded theories of computational semantics each have their uses and their drawbacks. There has been a shift to ground models of language by adding visual knowledge, and there has been a call to enrich models of language with symbolic methods to gain the benefits from formal, distributional, and grounded theories. In this paper, we attempt to make the case that one potential path forward in unifying all three semantic fields is paved with the words-as-classifier model, a model of word-level grounded semantics that has been incorporated into formalisms and distributional language models in the literature, and it has been well-tested within interactive dialogue settings. We review that literature, motivate the words-as-classifiers model with an appeal to recent work in cognitive science, and describe a small experiment. Finally, we sketch a model of semantics unified through words-as-classifiers.

* 9 pages

Via

Access Paper or Ask Questions

Incremental Dialogue Management: Survey, Discussion, and Implications for HRI

Jan 01, 2025

Casey Kennington, Pierre Lison, David Schlangen

Figure 1 for Incremental Dialogue Management: Survey, Discussion, and Implications for HRI

Figure 2 for Incremental Dialogue Management: Survey, Discussion, and Implications for HRI

Figure 3 for Incremental Dialogue Management: Survey, Discussion, and Implications for HRI

Figure 4 for Incremental Dialogue Management: Survey, Discussion, and Implications for HRI

Abstract:Efforts towards endowing robots with the ability to speak have benefited from recent advancements in NLP, in particular large language models. However, as powerful as current models have become, they still operate on sentence or multi-sentence level input, not on the word-by-word input that humans operate on, affecting the degree of responsiveness that they offer, which is critical in situations where humans interact with robots using speech. In this paper, we review the literature on interactive systems that operate incrementally (i.e., at the word level or below it). We motivate the need for incremental systems, survey incremental modeling of important aspects of dialogue like speech recognition and language generation. Primary focus is on the part of the system that makes decisions, known as the dialogue manager. We find that there is very little research on incremental dialogue management, offer some requirements for practical incremental dialogue management, and the implications of incremental dialogue for embodied, robotic platforms.

* 16 pages

Via

Access Paper or Ask Questions

Renaissance: Investigating the Pretraining of Vision-Language Encoders

Nov 11, 2024

Clayton Fields, Casey Kennington

Figure 1 for Renaissance: Investigating the Pretraining of Vision-Language Encoders

Figure 2 for Renaissance: Investigating the Pretraining of Vision-Language Encoders

Figure 3 for Renaissance: Investigating the Pretraining of Vision-Language Encoders

Figure 4 for Renaissance: Investigating the Pretraining of Vision-Language Encoders

Abstract:In the past several years there has been an explosion of available models for vision-language tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. In this paper we seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis. In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining. In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model. Additionally, we introduce a VL modeling platform called Renaissance that we use to conduct all of the experiments. This program offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. The source code for Renaissance can be found at https://github.com/bsu-slim/renaissance.

Via

Access Paper or Ask Questions

Unsupervised, Bottom-up Category Discovery for Symbol Grounding with a Curious Robot

Apr 03, 2024

Catherine Henry, Casey Kennington

Figure 1 for Unsupervised, Bottom-up Category Discovery for Symbol Grounding with a Curious Robot

Figure 2 for Unsupervised, Bottom-up Category Discovery for Symbol Grounding with a Curious Robot

Figure 3 for Unsupervised, Bottom-up Category Discovery for Symbol Grounding with a Curious Robot

Figure 4 for Unsupervised, Bottom-up Category Discovery for Symbol Grounding with a Curious Robot

Abstract:Towards addressing the Symbol Grounding Problem and motivated by early childhood language development, we leverage a robot which has been equipped with an approximate model of curiosity with particular focus on bottom-up building of unsupervised categories grounded in the physical world. That is, rather than starting with a top-down symbol (e.g., a word referring to an object) and providing meaning through the application of predetermined samples, the robot autonomously and gradually breaks up its exploration space into a series of increasingly specific unlabeled categories at which point an external expert may optionally provide a symbol association. We extend prior work by using a robot that can observe the visual world, introducing a higher dimensional sensory space, and using a more generalizable method of category building. Our experiments show that the robot learns categories based on actions and what it visually observes, and that those categories can be symbolically grounded into.https://info.arxiv.org/help/prep#comments

* 10 pages

Via

Access Paper or Ask Questions

Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community

Apr 01, 2024

Casey Kennington, Malihe Alikhani, Heather Pon-Barry, Katherine Atwell, Yonatan Bisk, Daniel Fried, Felix Gervits, Zhao Han, Mert Inan, Michael Johnston(+13 more)

Figure 1 for Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community

Figure 2 for Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community

Figure 3 for Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community

Abstract:The ability to interact with machines using natural human language is becoming not just commonplace, but expected. The next step is not just text interfaces, but speech interfaces and not just with computers, but with all machines including robots. In this paper, we chronicle the recent history of this growing field of spoken dialogue with robots and offer the community three proposals, the first focused on education, the second on benchmarks, and the third on the modeling of language when it comes to spoken interaction with robots. The three proposals should act as white papers for any researcher to take and build upon.

* NSF Report on the "Dialogue with Robots" Workshop held in Pittsburg, PA, April 2023

Via

Access Paper or Ask Questions

Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning

Feb 16, 2024

Jun Zhuang, Casey Kennington

Figure 1 for Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning

Figure 2 for Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning

Figure 3 for Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning

Figure 4 for Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning

Abstract:As new research on Large Language Models (LLMs) continues, it is difficult to keep up with new research and models. To help researchers synthesize the new research many have written survey papers, but even those have become numerous. In this paper, we develop a method to automatically assign survey papers to a taxonomy. We collect the metadata of 144 LLM survey papers and explore three paradigms to classify papers within the taxonomy. Our work indicates that leveraging graph structure information on co-category graphs can significantly outperform the language models in two paradigms; pre-trained language models' fine-tuning and zero-shot/few-shot classifications using LLMs. We find that our model surpasses an average human recognition level and that fine-tuning LLMs using weak labels generated by a smaller model, such as the GCN in this study, can be more effective than using ground-truth labels, revealing the potential of weak-to-strong generalization in the taxonomy classification task.

* TL;DR: We collected metadata about LLM surveys and developed a method for categorizing them into a taxonomy, indicating the superiority of graph representation learning over language models and revealing the efficacy of fine-tuning using weak labels

Via

Access Paper or Ask Questions

A Multi-Perspective Learning to Rank Approach to Support Children's Information Seeking in the Classroom

Aug 29, 2023

Garrett Allen, Katherine Landau Wright, Jerry Alan Fails, Casey Kennington, Maria Soledad Pera

Figure 1 for A Multi-Perspective Learning to Rank Approach to Support Children's Information Seeking in the Classroom

Figure 2 for A Multi-Perspective Learning to Rank Approach to Support Children's Information Seeking in the Classroom

Figure 3 for A Multi-Perspective Learning to Rank Approach to Support Children's Information Seeking in the Classroom

Figure 4 for A Multi-Perspective Learning to Rank Approach to Support Children's Information Seeking in the Classroom

Abstract:We introduce a novel re-ranking model that aims to augment the functionality of standard search engines to support classroom search activities for children (ages 6 to 11). This model extends the known listwise learning-to-rank framework by balancing risk and reward. Doing so enables the model to prioritize Web resources of high educational alignment, appropriateness, and adequate readability by analyzing the URLs, snippets, and page titles of Web resources retrieved by a given mainstream search engine. Experimental results, including an ablation study and comparisons with existing baselines, showcase the correctness of the proposed model. The outcomes of this work demonstrate the value of considering multiple perspectives inherent to the classroom setting, e.g., educational alignment, readability, and objectionability, when applied to the design of algorithms that can better support children's information discovery.

* Extended version of the manuscript to appear in proceedings of the 22nd IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology

Via

Access Paper or Ask Questions

On the Computational Modeling of Meaning: Embodied Cognition Intertwined with Emotion

Jul 12, 2023

Casey Kennington

Figure 1 for On the Computational Modeling of Meaning: Embodied Cognition Intertwined with Emotion

Abstract:This document chronicles this author's attempt to explore how words come to mean what they do, with a particular focus on child language acquisition and what that means for models of language understanding.\footnote{I say \emph{historical} because I synthesize the ideas based on when I discovered them and how those ideas influenced my later thinking.} I explain the setting for child language learning, how embodiment -- being able to perceive and enact in the world, including knowledge of concrete and abstract concepts -- is crucial, and how emotion and cognition relate to each other and the language learning process. I end with what I think are some of the requirements for a language-learning agent that learns language in a setting similar to that of children. This paper can act as a potential guide for ongoing and future work in modeling language.

* 18 pages

Via

Access Paper or Ask Questions