Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Edward Y. Chang

CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse

Feb 09, 2026

Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai(+3 more)

Abstract:LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung collapse, where models answer interventional queries with associational evidence; (2) resisting sycophantic drift under adversarial pressure; and (3) generating Wise Refusals that specify missing information when evidence is underdetermined. Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives and decomposes performance into Utility (sensitivity) and Safety (specificity), revealing failure modes invisible to aggregate accuracy. Developed through a rigorous human-machine collaborative pipeline involving 40 domain experts, iterative cross-validation cycles, and composite verification via rule-based, LLM, and human scoring, CausalT5K implements Pearl's Ladder of Causation as research infrastructure. Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail, a finding that demonstrates CausalT5K's value for advancing trustworthy reasoning systems. Repository: https://github.com/genglongling/CausalT5kBench

* 17 pages, 20 tables, figures

Via

Access Paper or Ask Questions

RAudit: A Blind Auditing Protocol for Large Language Model Reasoning

Jan 30, 2026

Edward Y. Chang, Longling Geng

Abstract:Inference-time scaling can amplify reasoning pathologies: sycophancy, rung collapse, and premature certainty. We present RAudit, a diagnostic protocol for auditing LLM reasoning without ground truth access. The key constraint is blindness: the auditor evaluates only whether derivation steps support conclusions, enabling detection of trace-output inconsistency and, when latent competence exists, its recovery. RAudit measures process quality via CRIT-based reasonableness scores and varies critique formulation to study how social framing affects model response. We prove bounded correction and $O(\log(1/ε))$ termination. Experiments on mathematical reasoning (CAP-GSM8K) and causal judgment (CausalL2) reveal four mechanisms explaining model unreliability: (1) Latent Competence Suppression, where models derive correct answers then overwrite them under social pressure; (2) The False Competence Trap, where weaker judges mask sycophancy that stronger judges expose; (3) The Complexity-Vulnerability Tradeoff, where causal tasks induce more than 10 times higher sycophancy than mathematical tasks; and (4) Iatrogenic Critique, where authoritative correction harms weaker models. These findings challenge assumptions that capability implies robustness and that stronger feedback yields better outputs.

* 24 pages, 21 tables, 3 figures

Via

Access Paper or Ask Questions

T3: Benchmarking Sycophancy and Skepticism in Causal Judgment

Jan 13, 2026

Edward Y. Chang

Abstract:We introduce T3 (Testing Trustworthy Thinking), a diagnostic benchmark designed to rigorously evaluate LLM causal judgment across Pearl's Ladder of Causality. Comprising 454 expert-curated vignettes, T3 prioritizes high-resolution failure analysis, decomposing performance into Utility (sensitivity), Safety (specificity), and Wise Refusal on underdetermined cases. By applying T3 to frontier models, we diagnose two distinct pathologies: a "Skepticism Trap" at L1 (where safety-tuned models like Claude Haiku reject 60% of valid links) and a non-monotonic Scaling Paradox at L3. In the latter, the larger GPT-5.2 underperforms GPT-4-Turbo by 55 points on ambiguous counterfactuals, driven by a collapse into paralysis (excessive hedging) rather than hallucination. Finally, we use the benchmark to validate a process-verified protocol (RCA), showing that T3 successfully captures the restoration of decisive causal judgment under structured verification.

* 17 pages, 4 figures, 11 tables

Via

Access Paper or Ask Questions

Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning

Oct 06, 2025

Edward Y. Chang, Ethan Y. Chang

Abstract:Multi-agent debate often wastes compute by using a fixed adversarial stance, aggregating without deliberation, or stopping on heuristics. We introduce MACI, an active controller with two independent dials that decouple information from behavior: an information dial that gates evidence by quality, and a behavior dial that schedules contentiousness from exploration to consolidation. A moderator tracks disagreement, overlap, evidence quality, and argument quality, and halts when gains plateau. We provide theory-lite guarantees for nonincreasing dispersion and provable termination, with a budget-feasible scheduler. Across clinical diagnosis and news-bias tasks, MACI improves accuracy and calibration while reducing tokens, and converts residual uncertainty into precision RAG plans that specify what to retrieve next. We use a cross-family LLM judge (CRIT) as a conservative soft weight and stop signal, validated for order invariance and judge-swap stability; stability depends on using high-capability judges. MACI turns debate into a budget-aware, measurable, and provably terminating controller.

* 27 pages, 5 figures, 21 tables

Via

Access Paper or Ask Questions

ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning

May 18, 2025

Edward Y. Chang, Longling Geng

Figure 1 for ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning

Figure 2 for ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning

Figure 3 for ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning

Figure 4 for ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning

Abstract:Large language models (LLMs) excel at rapid generation of text and multimodal content, yet they falter on transaction-style planning that demands ACID-like guarantees and real-time disruption recovery. We present Adaptive LLM Agent System (ALAS), a framework that tackles four fundamental LLM deficits: (i) absence of self-verification, (ii) context erosion, (iii) next-token myopia, and (iv) lack of persistent state. ALAS decomposes each plan into role-specialized agents, equips them with automatic state tracking, and coordinates them through a lightweight protocol. When disruptions arise, agents apply history-aware local compensation, avoiding costly global replanning and containing cascade effects. On real-world, large-scale job-shop scheduling benchmarks, ALAS sets new best results for static sequential planning and excels in dynamic reactive scenarios with unexpected disruptions. These gains show that principled modularization plus targeted compensation can unlock scalable and resilient planning with LLMs.

* 36 pages, 10 figures, 19 tables

Via

Access Paper or Ask Questions

SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning

Mar 15, 2025

Edward Y. Chang

Figure 1 for SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning

Figure 2 for SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning

Figure 3 for SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning

Figure 4 for SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning

Abstract:Recent LLM-based agent frameworks have demonstrated impressive capabilities in task delegation and workflow orchestration, but face significant challenges in maintaining context awareness and ensuring planning consistency. This paper presents SagaLLM, a structured multi-agent framework that addresses four fundamental limitations in current LLM approaches: inadequate self-validation, context narrowing, lacking transaction properties, and insufficient inter-agent coordination. By implementing specialized context management agents and validation protocols, SagaLLM preserves critical constraints and state information throughout complex planning processes, enabling robust and consistent decision-making even during disruptions. We evaluate our approach using selected problems from the REALM benchmark, focusing on sequential and reactive planning scenarios that challenge both context retention and adaptive reasoning. Our experiments with state-of-the-art LLMs, Claude 3.7, DeepSeek R1, GPT-4o, and GPT-o1, demonstrate that while these models exhibit impressive reasoning capabilities, they struggle with maintaining global constraint awareness during complex planning tasks, particularly when adapting to unexpected changes. In contrast, the distributed cognitive architecture of SagaLLM shows significant improvements in planning consistency, constraint enforcement, and adaptation to disruptions in various scenarios.

* 13 pages, 8 tables, 5 figures

Via

Access Paper or Ask Questions

REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems

Feb 26, 2025

Longling Geng, Edward Y. Chang

Abstract:This benchmark suite provides a comprehensive evaluation framework for assessing both individual LLMs and multi-agent systems in real-world planning scenarios. The suite encompasses eleven designed problems that progress from basic to highly complex, incorporating key aspects such as multi-agent coordination, inter-agent dependencies, and dynamic environmental disruptions. Each problem can be scaled along three dimensions: the number of parallel planning threads, the complexity of inter-dependencies, and the frequency of unexpected disruptions requiring real-time adaptation. The benchmark includes detailed specifications, evaluation metrics, and baseline implementations using contemporary frameworks like LangGraph, enabling rigorous testing of both single-agent and multi-agent planning capabilities. Through standardized evaluation criteria and scalable complexity, this benchmark aims to drive progress in developing more robust and adaptable AI planning systems for real-world applications.

* 14 pages, 4 figures, 9 tables

Via

Access Paper or Ask Questions

MACI: Multi-Agent Collaborative Intelligence for Adaptive Reasoning and Temporal Planning

Jan 29, 2025

Edward Y. Chang

Abstract:Artificial intelligence requires deliberate reasoning, temporal awareness, and effective constraint management, capabilities traditional LLMs often lack due to their reliance on pattern matching, limited self-verification, and inconsistent constraint handling. We introduce Multi-Agent Collaborative Intelligence (MACI), a framework comprising three key components: 1) a meta-planner (MP) that identifies, formulates, and refines all roles and constraints of a task (e.g., wedding planning) while generating a dependency graph, with common-sense augmentation to ensure realistic and practical constraints; 2) a collection of agents to facilitate planning and address task-specific requirements; and 3) a run-time monitor that manages plan adjustments as needed. By decoupling planning from validation, maintaining minimal agent context, and integrating common-sense reasoning, MACI overcomes the aforementioned limitations and demonstrates robust performance in two scheduling problems.

* Stanford University InfoLab Technical Report, 2025
* 21 pages, 19 tables

Via

Access Paper or Ask Questions

Unlocking the Wisdom of Large Language Models: An Introduction to The Path to Artificial General Intelligence

Sep 02, 2024

Edward Y. Chang

Abstract:This booklet, "Unlocking the Wisdom of Large Language Models," serves as an introduction to the comprehensive work "The Path to Artificial General Intelligence." Through a series of nine aphorisms, we distill key insights and principles that underpin the larger exploration of AI's future through adversarial LLM dialogue. We propose this approach as a potential path to realizing artificial general intelligence (AGI). This booklet also includes the titles, abstracts, and introductions of the chapters in the main book, and presents the first two chapters in their entirety.

Via

Access Paper or Ask Questions

EVINCE: Optimizing Adversarial LLM Dialogues via Conditional Statistics and Information Theory

Aug 26, 2024

Edward Y. Chang

Abstract:This paper introduces EVINCE (Entropy and Variation IN Conditional Exchanges), a dialogue framework advancing Artificial General Intelligence (AGI) by enhancing versatility, adaptivity, and reasoning in large language models (LLMs). Leveraging adversarial debate and a novel dual entropy theory, EVINCE improves prediction accuracy, robustness, and stability in LLMs by integrating statistical modeling, information theory, and machine learning to balance diverse perspective exploration with strong prior exploitation. The framework's effectiveness is demonstrated through consistent convergence of information-theoretic metrics, particularly improved mutual information, fostering productive LLM collaboration. We apply EVINCE to healthcare, showing improved disease diagnosis, and discuss its broader implications for decision-making across domains. This work provides theoretical foundations and empirical validation for EVINCE, paving the way for advancements in LLM collaboration and AGI development.

* 19 pages, 7 figures, four tables

Via

Access Paper or Ask Questions