Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lingjia Tang

SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

May 21, 2025

Roland Daynauth, Christopher Clarke, Krisztian Flautner, Lingjia Tang, Jason Mars

Figure 1 for SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

Figure 2 for SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

Figure 3 for SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

Figure 4 for SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

Abstract:The LLM-as-a-Judge paradigm offers a scalable, reference-free approach for evaluating language models. Although several calibration techniques have been proposed to better align these evaluators with human judgment, prior studies focus primarily on narrow, well-structured benchmarks. As a result, it remains unclear whether such calibrations generalize to real-world, open-ended tasks. In this work, we show that SOTA calibrated evaluators often fail in these settings, exhibiting weak or even negative correlation with human judgments. To address this, we propose SLMEval, a novel and efficient calibration method based on entropy maximization over a small amount of human preference data. By estimating a latent distribution over model quality and reweighting evaluator scores accordingly, SLMEval achieves strong correlation with human evaluations across two real-world production use cases and the public benchmark. For example, on one such task, SLMEval achieves a Spearman correlation of 0.57 with human judgments, while G-Eval yields a negative correlation. In addition, SLMEval reduces evaluation costs by 5-30x compared to GPT-4-based calibrated evaluators such as G-eval.

Via

Access Paper or Ask Questions

A Graph-Based Approach for Conversational AI-Driven Personal Memory Capture and Retrieval in a Real-world Application

Dec 06, 2024

Savini Kashmira, Jayanaka L. Dantanarayana, Joshua Brodsky, Ashish Mahendra, Yiping Kang, Krisztian Flautner, Lingjia Tang, Jason Mars

Abstract:TOBU is a novel mobile application that captures and retrieves `personal memories' (pictures/videos together with stories and context around those moments) in a user-engaging AI-guided conversational approach. Our initial prototype showed that existing retrieval techniques such as retrieval-augmented generation (RAG) systems fall short due to their limitations in understanding memory relationships, causing low recall, hallucination, and unsatisfactory user experience. We design TOBUGraph, a novel graph-based retrieval approach. During capturing, TOBUGraph leverages large language models (LLMs) to automatically create a dynamic knowledge graph of memories, establishing context and relationships of those memories. During retrieval, TOBUGraph combines LLMs with the memory graph to achieve comprehensive recall through graph traversal. Our evaluation using real user data demonstrates that TOBUGraph outperforms multiple RAG implementations in both precision and recall, significantly improving user experience through improved retrieval accuracy and reduced hallucination.

Via

Access Paper or Ask Questions

Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat

Nov 19, 2024

Roland Daynauth, Christopher Clarke, Krisztian Flautner, Lingjia Tang, Jason Mars

Figure 1 for Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat

Figure 2 for Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat

Figure 3 for Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat

Figure 4 for Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat

Abstract:Deciding which large language model (LLM) to use is a complex challenge. Pairwise ranking has emerged as a new method for evaluating human preferences for LLMs. This approach entails humans evaluating pairs of model outputs based on a predefined criterion. By collecting these comparisons, a ranking can be constructed using methods such as Elo. However, applying these algorithms as constructed in the context of LLM evaluation introduces several challenges. In this paper, we explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. We formally define a set of fundamental principles for effective ranking and conduct a series of extensive evaluations on the robustness of several ranking algorithms in the context of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency, offering guidelines for selecting the most appropriate methods based on specific evaluation contexts and resource constraints.

Via

Access Paper or Ask Questions

PEFT-U: Parameter-Efficient Fine-Tuning for User Personalization

Jul 25, 2024

Christopher Clarke, Yuzhao Heng, Lingjia Tang, Jason Mars

Figure 1 for PEFT-U: Parameter-Efficient Fine-Tuning for User Personalization

Figure 2 for PEFT-U: Parameter-Efficient Fine-Tuning for User Personalization

Figure 3 for PEFT-U: Parameter-Efficient Fine-Tuning for User Personalization

Figure 4 for PEFT-U: Parameter-Efficient Fine-Tuning for User Personalization

Abstract:The recent emergence of Large Language Models (LLMs) has heralded a new era of human-AI interaction. These sophisticated models, exemplified by Chat-GPT and its successors, have exhibited remarkable capabilities in language understanding. However, as these LLMs have undergone exponential growth, a crucial dimension that remains understudied is the personalization of these models. Large foundation models such as GPT-3 etc. focus on creating a universal model that serves a broad range of tasks and users. This approach emphasizes the model's generalization capabilities, treating users as a collective rather than as distinct individuals. While practical for many common applications, this one-size-fits-all approach often fails to address the rich tapestry of human diversity and individual needs. To explore this issue we introduce the PEFT-U Benchmark: a new dataset for building and evaluating NLP models for user personalization. \datasetname{} consists of a series of user-centered tasks containing diverse and individualized expressions where the preferences of users can potentially differ for the same input. Using PEFT-U, we explore the challenge of efficiently personalizing LLMs to accommodate user-specific preferences in the context of diverse user-centered tasks.

Via

Access Paper or Ask Questions

LLMs are Meaning-Typed Code Constructs

May 14, 2024

Jason Mars, Yiping Kang, Jayanaka Dantanarayana, Chandra Irugalbandara, Kugesan Sivasothynathan, Lingjia Tang

Figure 1 for LLMs are Meaning-Typed Code Constructs

Figure 2 for LLMs are Meaning-Typed Code Constructs

Figure 3 for LLMs are Meaning-Typed Code Constructs

Figure 4 for LLMs are Meaning-Typed Code Constructs

Abstract:Programming with Generative AI (GenAI) models is a type of Neurosymbolic programming and has seen tremendous adoption across many domains. However, leveraging GenAI models in code today can be complex, counter-intuitive and often require specialized frameworks, leading to increased complexity. This is because it is currently unclear as to the right abstractions through which we should marry GenAI models with the nature of traditional programming code constructs. In this paper, we introduce a set of novel abstractions to help bridge the gap between Neuro- and symbolic programming. We introduce Meaning, a new specialized type that represents the underlying semantic value of traditional types (e.g., string). We make the case that GenAI models, LLMs in particular, should be reasoned as a meaning-type wrapped code construct at the language level. We formulate the problem of translation between meaning and traditional types and propose Automatic Meaning-Type Transformation (A-MTT), a runtime feature that abstracts this translation away from the developers by automatically converting between M eaning and types at the interface of LLM invocation. Leveraging this new set of code constructs and OTT, we demonstrate example implementation of neurosymbolic programs that seamlessly utilizes LLMs to solve problems in place of potentially complex traditional programming logic.

Via

Access Paper or Ask Questions

A Trade-off Analysis of Replacing Proprietary LLMs with Open Source SLMs in Production

Jan 15, 2024

Chandra Irugalbandara, Ashish Mahendra, Roland Daynauth, Tharuka Kasthuri Arachchige, Krisztian Flautner, Lingjia Tang, Yiping Kang, Jason Mars

Figure 1 for A Trade-off Analysis of Replacing Proprietary LLMs with Open Source SLMs in Production

Figure 2 for A Trade-off Analysis of Replacing Proprietary LLMs with Open Source SLMs in Production

Figure 3 for A Trade-off Analysis of Replacing Proprietary LLMs with Open Source SLMs in Production

Figure 4 for A Trade-off Analysis of Replacing Proprietary LLMs with Open Source SLMs in Production

Abstract:Many companies rely on APIs of managed AI models such as OpenAI's GPT-4 to create AI-enabled experiences in their products. Along with the benefits of ease of use and shortened time to production, this reliance on proprietary APIs has downsides in terms of model control, performance reliability, up-time predictability, and cost. At the same time, there has been a flurry of open source small language models (SLMs) that have been made available for commercial use. However, their readiness to replace existing capabilities remains unclear, and a systematic approach to test these models is not readily available. In this paper, we present a systematic evaluation methodology for, and characterization of, modern open source SLMs and their trade-offs when replacing a proprietary LLM APIs for a real-world product feature. We have designed SLaM, an automated analysis tool that enables the quantitative and qualitative testing of product features utilizing arbitrary SLMs. Using SLaM, we examine both the quality and the performance characteristics of modern SLMs relative to an existing customer-facing OpenAI-based implementation. We find that across 9 SLMs and 29 variants, we observe competitive quality-of-results for our use case, significant performance consistency improvement, and a cost reduction of 5x-29x when compared to OpenAI GPT-4.

* Updated title

Via

Access Paper or Ask Questions

One Agent Too Many: User Perspectives on Approaches to Multi-agent Conversational AI

Jan 13, 2024

Christopher Clarke, Karthik Krishnamurthy, Walter Talamonti, Yiping Kang, Lingjia Tang, Jason Mars

Abstract:Conversational agents have been gaining increasing popularity in recent years. Influenced by the widespread adoption of task-oriented agents such as Apple Siri and Amazon Alexa, these agents are being deployed into various applications to enhance user experience. Although these agents promote "ask me anything" functionality, they are typically built to focus on a single or finite set of expertise. Given that complex tasks often require more than one expertise, this results in the users needing to learn and adopt multiple agents. One approach to alleviate this is to abstract the orchestration of agents in the background. However, this removes the option of choice and flexibility, potentially harming the ability to complete tasks. In this paper, we explore these different interaction experiences (one agent for all) vs (user choice of agents) for conversational AI. We design prototypes for each, systematically evaluating their ability to facilitate task completion. Through a series of conducted user studies, we show that users have a significant preference for abstracting agent orchestration in both system usability and system performance. Additionally, we demonstrate that this mode of interaction is able to provide quality responses that are rated within 1% of human-selected answers.

Via

Access Paper or Ask Questions

Label Agnostic Pre-training for Zero-shot Text Classification

May 25, 2023

Christopher Clarke, Yuzhao Heng, Yiping Kang, Krisztian Flautner, Lingjia Tang, Jason Mars

Figure 1 for Label Agnostic Pre-training for Zero-shot Text Classification

Figure 2 for Label Agnostic Pre-training for Zero-shot Text Classification

Figure 3 for Label Agnostic Pre-training for Zero-shot Text Classification

Figure 4 for Label Agnostic Pre-training for Zero-shot Text Classification

Abstract:Conventional approaches to text classification typically assume the existence of a fixed set of predefined labels to which a given text can be classified. However, in real-world applications, there exists an infinite label space for describing a given text. In addition, depending on the aspect (sentiment, topic, etc.) and domain of the text (finance, legal, etc.), the interpretation of the label can vary greatly. This makes the task of text classification, particularly in the zero-shot scenario, extremely challenging. In this paper, we investigate the task of zero-shot text classification with the aim of improving the ability of pre-trained language models (PLMs) to generalize to both seen and unseen data across varying aspects and domains. To solve this we introduce two new simple yet effective pre-training strategies, Implicit and Explicit pre-training. These methods inject aspect-level understanding into the model at train time with the goal of conditioning the model to build task-level understanding. To evaluate this, we construct and release UTCD, a new benchmark dataset for evaluating text classification in zero-shot settings. Experimental results on UTCD show that our approach achieves improved zero-shot generalization on a suite of challenging datasets across an array of zero-shot formalizations.

* Findings of ACL 2023

Via

Access Paper or Ask Questions

The Jaseci Programming Paradigm and Runtime Stack: Building Scale-out Production Applications Easy and Fast

May 17, 2023

Jason Mars, Yiping Kang, Roland Daynauth, Baichuan Li, Ashish Mahendra, Krisztian Flautner, Lingjia Tang

Figure 1 for The Jaseci Programming Paradigm and Runtime Stack: Building Scale-out Production Applications Easy and Fast

Figure 2 for The Jaseci Programming Paradigm and Runtime Stack: Building Scale-out Production Applications Easy and Fast

Figure 3 for The Jaseci Programming Paradigm and Runtime Stack: Building Scale-out Production Applications Easy and Fast

Figure 4 for The Jaseci Programming Paradigm and Runtime Stack: Building Scale-out Production Applications Easy and Fast

Abstract:Today's production scale-out applications include many sub-application components, such as storage backends, logging infrastructure and AI models. These components have drastically different characteristics, are required to work in collaboration, and interface with each other as microservices. This leads to increasingly high complexity in developing, optimizing, configuring, and deploying scale-out applications, raising the barrier to entry for most individuals and small teams. We developed a novel co-designed runtime system, Jaseci, and programming language, Jac, which aims to reduce this complexity. The key design principle throughout Jaseci's design is to raise the level of abstraction by moving as much of the scale-out data management, microservice componentization, and live update complexity into the runtime stack to be automated and optimized automatically. We use real-world AI applications to demonstrate Jaseci's benefit for application performance and developer productivity.

Via

Access Paper or Ask Questions

One Agent To Rule Them All: Towards Multi-agent Conversational AI

Mar 15, 2022

Christopher Clarke, Joseph Joshua Peper, Karthik Krishnamurthy, Walter Talamonti, Kevin Leach, Walter Lasecki, Yiping Kang, Lingjia Tang, Jason Mars

Figure 1 for One Agent To Rule Them All: Towards Multi-agent Conversational AI

Figure 2 for One Agent To Rule Them All: Towards Multi-agent Conversational AI

Figure 3 for One Agent To Rule Them All: Towards Multi-agent Conversational AI

Figure 4 for One Agent To Rule Them All: Towards Multi-agent Conversational AI

Abstract:The increasing volume of commercially available conversational agents (CAs) on the market has resulted in users being burdened with learning and adopting multiple agents to accomplish their tasks. Though prior work has explored supporting a multitude of domains within the design of a single agent, the interaction experience suffers due to the large action space of desired capabilities. To address these problems, we introduce a new task BBAI: Black-Box Agent Integration, focusing on combining the capabilities of multiple black-box CAs at scale. We explore two techniques: question agent pairing and question response pairing aimed at resolving this task. Leveraging these techniques, we design One For All (OFA), a scalable system that provides a unified interface to interact with multiple CAs. Additionally, we introduce MARS: Multi-Agent Response Selection, a new encoder model for question response pairing that jointly encodes user question and agent response pairs. We demonstrate that OFA is able to automatically and accurately integrate an ensemble of commercially available CAs spanning disparate domains. Specifically, using the MARS encoder we achieve the highest accuracy on our BBAI task, outperforming strong baselines.

Via

Access Paper or Ask Questions