Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Harsh Raj

OpenThoughts-Agent: Data Recipes for Agentic Models

Jun 23, 2026

Negin Raoof, Richard Zhuang, Marianna Nezhurina, Etash Guha, Atula Tejaswi, Ryan Marten, Charlie F. Ruan, Tyler Griggs, Alexander Glenn Shaw, Hritik Bansal(+40 more)

Abstract:Agentic language models dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks. The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully open data curation pipeline for training agentic models. We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity. We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks and a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%). Moreover, our training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons. We publicly release our training sets, data pipeline, experimental data, and models at openthoughts.ai to support future open research on agentic model training.

Via

Access Paper or Ask Questions

Joint Antenna Placement and Power Allocation for RSMA-Enabled Pinching Antenna Systems

Jun 14, 2026

Harsh Raj, Mallena Vardhan, Keshav Singh, Sudip Biswas

Abstract:This letter investigates a rate-splitting multiple access (RSMA)-enabled multi-user pinching antenna system (PASS). A fairness-aware sum-rate maximization problem is formulated to jointly optimize pinching antenna locations and common/private stream power allocation. The resulting mixed discrete-continuous non-convex problem is addressed using an alternating optimization framework that combines greedy antenna placement with successive convex approximation (SCA)-based power allocation. Numerical results demonstrate that the proposed RSMA-enabled PASS significantly improves achievable sum-rate, user fairness, and bit error rate (BER) performance compared with conventional non-RSMA PASS schemes.

Via

Access Paper or Ask Questions

RSMA Technique for Multi-User Downlink Single-Waveguide Multi-Pinching Antenna Systems

Jun 08, 2026

Harsh Raj, Mallena Vardhan, Keshav Singh, Sudip Biswas

Abstract:Pinching antennas have recently emerged as a promising technology for reconfigurable wireless systems due to their ability to dynamically radiate signals from flexible positions along a waveguide. This letter investigates a multi-user communication framework by integrating rate-splitting multiple access (RSMA) into a single-input single-output (SISO) single-waveguide architecture equipped with multiple pinching antennas. Multiple antennas are activated along a shared waveguide to radiate a common guided signal toward distributed users, enabling strong near-field line-of-sight (LoS) links with low hardware complexity and a single radiofrequency (RF) chain. To manage multi-user interference, RSMA is employed within the proposed architecture. Simulation results show that the proposed framework improves system sum-rate, enhances user rate fairness, and achieves lower bit error rate (BER) while preserving the low-cost and scalable characteristics of pinching antenna systems (PASS).

Via

Access Paper or Ask Questions

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

May 11, 2026

Harsh Raj, Niranjan Orkat, Suvrorup Mukherjee, Aritra Guha, Cheryl Flynn, Subhabrata Majumdar

Abstract:This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.

* 33 pages, 5 figures, 2 tables

Via

Access Paper or Ask Questions

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Jan 17, 2026

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan(+75 more)

Abstract:AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .

Via

Access Paper or Ask Questions

Improving Consistency in Large Language Models through Chain of Guidance

Feb 21, 2025

Harsh Raj, Vipul Gupta, Domenic Rosati, Subhabrata Majumdar

Abstract:Consistency is a fundamental dimension of trustworthiness in Large Language Models (LLMs). For humans to be able to trust LLM-based applications, their outputs should be consistent when prompted with inputs that carry the same meaning or intent. Despite this need, there is no known mechanism to control and guide LLMs to be more consistent at inference time. In this paper, we introduce a novel alignment strategy to maximize semantic consistency in LLM outputs. Our proposal is based on Chain of Guidance (CoG), a multistep prompting technique that generates highly consistent outputs from LLMs. For closed-book question-answering (Q&A) tasks, when compared to direct prompting, the outputs generated using CoG show improved consistency. While other approaches like template-based responses and majority voting may offer alternative paths to consistency, our work focuses on exploring the potential of guided prompting. We use synthetic data sets comprised of consistent input-output pairs to fine-tune LLMs to produce consistent and correct outputs. Our fine-tuned models are more than twice as consistent compared to base models and show strong generalization capabilities by producing consistent outputs over datasets not used in the fine-tuning process.

* Accepted at Transactions of Machine Learning Research (TMLR) 2025

Via

Access Paper or Ask Questions

On the Performance of IRS-Assisted SSK and RPM over Rician Fading Channels

Apr 10, 2024

Harsh Raj, Ugrasen Singh, B. R. Manoj

Figure 1 for On the Performance of IRS-Assisted SSK and RPM over Rician Fading Channels

Figure 2 for On the Performance of IRS-Assisted SSK and RPM over Rician Fading Channels

Figure 3 for On the Performance of IRS-Assisted SSK and RPM over Rician Fading Channels

Figure 4 for On the Performance of IRS-Assisted SSK and RPM over Rician Fading Channels

Abstract:This paper presents the index modulation, that is, the space-shift keying (SSK) and reflection phase modulation (RPM) schemes for intelligent reflecting surface (IRS)-assisted wireless network. IRS simultaneously reflects the incoming information signal from the base station and explicitly encodes the local information bits in the reflection phase shift of IRS elements. The phase shift of the IRS elements is employed according to local data from the RPM constellation. A joint detection using a maximum-likelihood (ML) decoder is performed for the SSK and RPM symbols over a realistic fading scenario modeled as the Rician fading channel. The pairwise error probability over Rician fading channels is derived and utilized to determine the average bit error rate. In addition, the ergodic capacity of the presented system is derived. The derived analytical results are verified and are in exact agreement with Monte-Carlo simulations.

* 5 pages, 3 figures, to be published in proceedings of IEEE 99th Vehicular Technology Conference (VTC) 2024

Via

Access Paper or Ask Questions

Semantic Consistency for Assuring Reliability of Large Language Models

Aug 17, 2023

Harsh Raj, Vipul Gupta, Domenic Rosati, Subhabrata Majumdar

Abstract:Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks. However, recent research has highlighted their sensitivity to variations in input prompts. To deploy LLMs in a safe and reliable manner, it is crucial for their outputs to be consistent when prompted with expressions that carry the same meaning or intent. While some existing work has explored how state-of-the-art LLMs address this issue, their evaluations have been confined to assessing lexical equality of single- or multi-word answers, overlooking the consistency of generative text sequences. For a more comprehensive understanding of the consistency of LLMs in open-ended text generation scenarios, we introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs. Our proposal demonstrates significantly higher consistency and stronger correlation with human evaluations of output consistency than traditional metrics based on lexical consistency. Finally, we propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency. When evaluated for closed-book question answering based on answer variations from the TruthfulQA benchmark, A2C increases accuracy metrics for pretrained and finetuned LLMs by up to 47%, and semantic consistency metrics for instruction-tuned models by up to 7-fold.

Via

Access Paper or Ask Questions

Measuring Reliability of Large Language Models through Semantic Consistency

Nov 10, 2022

Harsh Raj, Domenic Rosati, Subhabrata Majumdar

Abstract:While large pretrained language models (PLMs) demonstrate incredible fluency and performance on many natural language tasks, recent work has shown that well-performing PLMs are very sensitive to what prompts are feed into them. Even when prompts are semantically identical, language models may give very different answers. When considering safe and trustworthy deployments of PLMs we would like their outputs to be consistent under prompts that mean the same thing or convey the same intent. While some work has looked into how state-of-the-art PLMs address this need, they have been limited to only evaluating lexical equality of single- or multi-word answers and do not address consistency of generative text sequences. In order to understand consistency of PLMs under text generation settings, we develop a measure of semantic consistency that allows the comparison of open-ended text outputs. We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions in the TruthfulQA dataset, we find that our proposed metrics are considerably more consistent than traditional metrics embodying lexical consistency, and also correlate with human evaluation of output consistency to a higher degree.

* Accepted and presented in NeurIPS 2022 ML Safety Workshop, https://neurips2022.mlsafety.org

Via

Access Paper or Ask Questions

AskYourDB: An end-to-end system for querying and visualizing relational databases using natural language

Oct 16, 2022

Manu Joseph, Harsh Raj, Anubhav Yadav, Aaryamann Sharma

Figure 1 for AskYourDB: An end-to-end system for querying and visualizing relational databases using natural language

Figure 2 for AskYourDB: An end-to-end system for querying and visualizing relational databases using natural language

Figure 3 for AskYourDB: An end-to-end system for querying and visualizing relational databases using natural language

Figure 4 for AskYourDB: An end-to-end system for querying and visualizing relational databases using natural language

Abstract:Querying databases for the right information is a time consuming and error-prone task and often requires experienced professionals for the job. Furthermore, the user needs to have some prior knowledge about the database. There have been various efforts to develop an intelligence which can help business users to query databases directly. However, there has been some successes, but very little in terms of testing and deploying those for real world users. In this paper, we propose a semantic parsing approach to address the challenge of converting complex natural language into SQL and institute a product out of it. For this purpose, we modified state-of-the-art models, by various pre and post processing steps which make the significant part when a model is deployed in production. To make the product serviceable to businesses we added an automatic visualization framework over the queried results.

* 9 pages

Via

Access Paper or Ask Questions