Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

William Yang Wang

RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Dec 12, 2024

Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang

Figure 1 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Figure 2 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Figure 3 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Figure 4 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Abstract:This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate mathematical computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules, frequently becoming confused by similar but distinct regulations, (2) they cannot consistently perform accurate mathematical computations, even when they correctly identify the relevant rules, and (3) in general, they perform poorly in the benchmark. These results highlight significant challenges in advancing LLMs' rule-guided reasoning capabilities in real-life applications.

* Data and Codes are available at https://github.com/skyriver-2000/RuleArena

Via

Access Paper or Ask Questions

Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

Nov 27, 2024

Tiffany Zhu, Kexun Zhang, William Yang Wang

Figure 1 for Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

Figure 2 for Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

Figure 3 for Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

Figure 4 for Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

Abstract:The impressive essay writing and problem-solving capabilities of large language models (LLMs) like OpenAI's ChatGPT have opened up new avenues in education. Our goal is to gain insights into the widespread use of LLMs among secondary students to inform their future development. Despite school restrictions, our survey of over 300 middle and high school students revealed that a remarkable 70% of students have utilized LLMs, higher than the usage percentage among young adults, and this percentage remains consistent across 7th to 12th grade. Students also reported using LLMs for multiple subjects, including language arts, history, and math assignments, but expressed mixed thoughts on their effectiveness due to occasional hallucinations in historical contexts and incorrect answers for lack of rigorous reasoning. The survey feedback called for LLMs better adapted for students, and also raised questions to developers and educators on how to help students from underserved communities leverage LLMs' capabilities for equal access to advanced education resources. We propose a few ideas to address such issues, including subject-specific models, personalized learning, and AI classrooms.

* 6 main pages, 5 figures

Via

Access Paper or Ask Questions

Disentangling Memory and Reasoning Ability in Large Language Models

Nov 21, 2024

Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, Yongfeng Zhang

Abstract:Large Language Models (LLMs) have demonstrated strong performance in handling complex tasks requiring both extensive knowledge and reasoning abilities. However, the existing LLM inference pipeline operates as an opaque process without explicit separation between knowledge retrieval and reasoning steps, making the model's decision-making process unclear and disorganized. This ambiguity can lead to issues such as hallucinations and knowledge forgetting, which significantly impact the reliability of LLMs in high-stakes domains. In this paper, we propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions: (1) memory recall: which retrieves relevant knowledge, and (2) reasoning: which performs logical steps based on the recalled knowledge. To facilitate this decomposition, we introduce two special tokens memory and reason, guiding the model to distinguish between steps that require knowledge retrieval and those that involve reasoning. Our experiment results show that this decomposition not only improves model performance but also enhances the interpretability of the inference process, enabling users to identify sources of error and refine model responses effectively. The code is available at https://github.com/MingyuJ666/Disentangling-Memory-and-Reasoning.

Via

Access Paper or Ask Questions

Scaling LLM Inference with Optimized Sample Compute Allocation

Oct 29, 2024

Kexun Zhang, Shang Zhou, Danqing Wang, William Yang Wang, Lei Li

Figure 1 for Scaling LLM Inference with Optimized Sample Compute Allocation

Figure 2 for Scaling LLM Inference with Optimized Sample Compute Allocation

Figure 3 for Scaling LLM Inference with Optimized Sample Compute Allocation

Figure 4 for Scaling LLM Inference with Optimized Sample Compute Allocation

Abstract:Sampling is a basic operation in many inference-time algorithms of large language models (LLMs). To scale up inference efficiently with a limited compute, it is crucial to find an optimal allocation for sample compute budgets: Which sampling configurations (model, temperature, language, etc.) do we use? How many samples do we generate in each configuration? We formulate these choices as a learning problem and propose OSCA, an algorithm that Optimizes Sample Compute Allocation by finding an optimal mix of different inference configurations. Our experiments show that with our learned mixed allocation, we can achieve accuracy better than the best single configuration with 128x less compute on code generation and 25x less compute on 4 reasoning tasks. OSCA is also shown to be effective in agentic workflows beyond single-turn tasks, achieving a better accuracy on SWE-Bench with 3x less compute than the default configuration. Our code and generations are released at https://github.com/LeiLiLab/OSCA.

Via

Access Paper or Ask Questions

CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

Oct 17, 2024

Mian Zhang, Xianjun Yang, Xinlu Zhang, Travis Labrum, Jamie C. Chiu, Shaun M. Eack, Fei Fang, William Yang Wang, Zhiyu Zoey Chen

Figure 1 for CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

Figure 2 for CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

Figure 3 for CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

Figure 4 for CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

Abstract:There is a significant gap between patient needs and available mental health support today. In this paper, we aim to thoroughly examine the potential of using Large Language Models (LLMs) to assist professional psychotherapy. To this end, we propose a new benchmark, CBT-BENCH, for the systematic evaluation of cognitive behavioral therapy (CBT) assistance. We include three levels of tasks in CBT-BENCH: I: Basic CBT knowledge acquisition, with the task of multiple-choice questions; II: Cognitive model understanding, with the tasks of cognitive distortion classification, primary core belief classification, and fine-grained core belief classification; III: Therapeutic response generation, with the task of generating responses to patient speech in CBT therapy sessions. These tasks encompass key aspects of CBT that could potentially be enhanced through AI assistance, while also outlining a hierarchy of capability requirements, ranging from basic knowledge recitation to engaging in real therapeutic conversations. We evaluated representative LLMs on our benchmark. Experimental results indicate that while LLMs perform well in reciting CBT knowledge, they fall short in complex real-world scenarios requiring deep analysis of patients' cognitive structures and generating effective responses, suggesting potential future work.

Via

Access Paper or Ask Questions

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Oct 15, 2024

Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister

Figure 1 for Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Figure 2 for Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Figure 3 for Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Figure 4 for Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Abstract:Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student's inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.

Via

Access Paper or Ask Questions

COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

Oct 12, 2024

Yuxi Xie, Anirudh Goyal, Xiaobao Wu, Xunjian Yin, Xiao Xu, Min-Yen Kan, Liangming Pan, William Yang Wang

Figure 1 for COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

Figure 2 for COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

Figure 3 for COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

Figure 4 for COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

Abstract:Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. However, existing approaches typically implement iterative refinement at the application or prompting level, relying on autoregressive (AR) modeling. The sequential token generation in AR models can lead to high inference latency. To overcome these challenges, we propose Context-Wise Order-Agnostic Language Modeling (COrAL), which incorporates iterative refinement directly into the LLM architecture while maintaining computational efficiency. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally during the generation process. Leveraging the order-agnostic nature of COrAL, we introduce sliding blockwise order-agnostic decoding, which performs multi-token forward prediction and backward reconstruction within context windows. This allows the model to iteratively refine its outputs in parallel in the sliding block, effectively capturing diverse dependencies without the high inference cost of sequential generation. Empirical evaluations on reasoning tasks demonstrate that COrAL improves performance and inference speed, respectively, achieving absolute accuracy gains of $4.6\%$ on GSM8K and $4.0\%$ on LogiQA, along with inference speedups of up to $3.9\times$ over next-token baselines. Preliminary results on code generation indicate a drop in pass rates due to inconsistencies in order-agnostic outputs, highlighting the inherent quality--speed trade-off. Our code is publicly available at https://github.com/YuxiXie/COrAL.

* 12 pages, 7 figures, 3 tables (23 pages, 9 figures, 4 tables including references and appendices)

Via

Access Paper or Ask Questions

Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models

Oct 10, 2024

Sitao Cheng, Liangming Pan, Xunjian Yin, Xinyi Wang, William Yang Wang

Figure 1 for Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models

Figure 2 for Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models

Figure 3 for Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models

Figure 4 for Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models

Abstract:Large language models (LLMs) encode vast amounts of knowledge during pre-training (parametric knowledge, or PK) and can further be enhanced by incorporating contextual knowledge (CK). Can LLMs effectively integrate their internal PK with external CK to solve complex problems? In this paper, we investigate the dynamic interaction between PK and CK, categorizing their relationships into four types: Supportive, Complementary, Conflicting, and Irrelevant. To support this investigation, we introduce ECHOQA, a benchmark spanning scientific, factual, and commonsense knowledge. Our results show that LLMs tend to suppress their PK when contextual information is available, even when it is complementary or irrelevant. While tailored instructions can encourage LLMs to rely more on their PK, they still struggle to fully leverage it. These findings reveal a key vulnerability in LLMs, raising concerns about their reliability in knowledge-intensive tasks. Resources are available at https://github.com/sitaocheng/Knowledge Interplay.

* 27 pages, 8 figures and 17 tables

Via

Access Paper or Ask Questions

Detecting Training Data of Large Language Models via Expectation Maximization

Oct 10, 2024

Gyuwan Kim, Yang Li, Evangelia Spiliopoulou, Jie Ma, Miguel Ballesteros, William Yang Wang

Figure 1 for Detecting Training Data of Large Language Models via Expectation Maximization

Figure 2 for Detecting Training Data of Large Language Models via Expectation Maximization

Figure 3 for Detecting Training Data of Large Language Models via Expectation Maximization

Figure 4 for Detecting Training Data of Large Language Models via Expectation Maximization

Abstract:The widespread deployment of large language models (LLMs) has led to impressive advancements, yet information about their training data, a critical factor in their performance, remains undisclosed. Membership inference attacks (MIAs) aim to determine whether a specific instance was part of a target model's training data. MIAs can offer insights into LLM outputs and help detect and address concerns such as data contamination and compliance with privacy and copyright standards. However, applying MIAs to LLMs presents unique challenges due to the massive scale of pre-training data and the ambiguous nature of membership. Additionally, creating appropriate benchmarks to evaluate MIA methods is not straightforward, as training and test data distributions are often unknown. In this paper, we introduce EM-MIA, a novel MIA method for LLMs that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm, leveraging the duality that the estimates of these scores can be improved by each other. Membership scores and prefix scores assess how each instance is likely to be a member and discriminative as a prefix, respectively. Our method achieves state-of-the-art results on the WikiMIA dataset. To further evaluate EM-MIA, we present OLMoMIA, a benchmark built from OLMo resources, which allows us to control the difficulty of MIA tasks with varying degrees of overlap between training and test data distributions. We believe that EM-MIA serves as a robust MIA method for LLMs and that OLMoMIA provides a valuable resource for comprehensively evaluating MIA approaches, thereby driving future research in this critical area.

* 14 pages

Via

Access Paper or Ask Questions

Uncovering Factor Level Preferences to Improve Human-Model Alignment

Oct 09, 2024

Juhyun Oh, Eunsu Kim, Jiseon Kim, Wenda Xu, Inha Cha, William Yang Wang, Alice Oh

Figure 1 for Uncovering Factor Level Preferences to Improve Human-Model Alignment

Figure 2 for Uncovering Factor Level Preferences to Improve Human-Model Alignment

Figure 3 for Uncovering Factor Level Preferences to Improve Human-Model Alignment

Figure 4 for Uncovering Factor Level Preferences to Improve Human-Model Alignment

Abstract:Despite advancements in Large Language Model (LLM) alignment, understanding the reasons behind LLM preferences remains crucial for bridging the gap between desired and actual behavior. LLMs often exhibit biases or tendencies that diverge from human preferences, such as favoring certain writing styles or producing overly verbose outputs. However, current methods for evaluating preference alignment often lack explainability, relying on coarse-grained comparisons. To address this, we introduce PROFILE (PRObing Factors of InfLuence for Explainability), a novel framework that uncovers and quantifies the influence of specific factors driving preferences. PROFILE's factor level analysis explains the 'why' behind human-model alignment and misalignment, offering insights into the direction of model improvement. We apply PROFILE to analyze human and LLM preferences across three tasks: summarization, helpful response generation, and document-based question-answering. Our factor level analysis reveals a substantial discrepancy between human and LLM preferences in generation tasks, whereas LLMs show strong alignment with human preferences in evaluation tasks. We demonstrate how leveraging factor level insights, including addressing misaligned factors or exploiting the generation-evaluation gap, can improve alignment with human preferences. This work underscores the importance of explainable preference analysis and highlights PROFILE's potential to provide valuable training signals, driving further improvements in human-model alignment.

Via

Access Paper or Ask Questions