Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Elman Mansimov

DREAM: Deep Research Evaluation with Agentic Metrics

Feb 21, 2026

Elad Ben Avraham, Changhao Li, Ron Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah Flynn, Elman Mansimov, Adi Kalyanpur(+1 more)

Abstract:Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.

Via

Access Paper or Ask Questions

A Study on Leveraging Search and Self-Feedback for Agent Reasoning

Feb 17, 2025

Karthikeyan K, Michelle Yuan, Elman Mansimov, Katerina Margatina, Anurag Pratik, Daniele Bonadiman, Monica Sunkara, Yi Zhang, Yassine Benajiba

Figure 1 for A Study on Leveraging Search and Self-Feedback for Agent Reasoning

Figure 2 for A Study on Leveraging Search and Self-Feedback for Agent Reasoning

Figure 3 for A Study on Leveraging Search and Self-Feedback for Agent Reasoning

Figure 4 for A Study on Leveraging Search and Self-Feedback for Agent Reasoning

Abstract:Recent works have demonstrated that incorporating search during inference can significantly improve reasoning capabilities of language agents. Some approaches may make use of the ground truth or rely on model's own generated feedback. The search algorithm uses this feedback to then produce values that will update its criterion for exploring and exploiting various reasoning paths. In this study, we investigate how search and model's self-feedback can be leveraged for reasoning tasks. First, we explore differences in ground-truth feedback and self-feedback during search for math reasoning. Second, we observe limitations in applying search techniques to more complex tasks like tool-calling and design domain-specific approaches to address these gaps. Our experiments reveal challenges related to generalization when solely relying on self-feedback during search. For search to work effectively, either access to the ground-truth is needed or feedback mechanisms need to be carefully designed for the specific task.

* Under review

Via

Access Paper or Ask Questions

Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Jan 10, 2024

Dennis Ulmer, Elman Mansimov, Kaixiang Lin, Justin Sun, Xibin Gao, Yi Zhang

Figure 1 for Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Figure 2 for Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Figure 3 for Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Figure 4 for Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Abstract:Large language models (LLMs) are powerful dialogue agents, but specializing them towards fulfilling a specific function can be challenging. Instructing tuning, i.e. tuning models on instruction and sample responses generated by humans (Ouyang et al., 2022), has proven as an effective method to do so, yet requires a number of data samples that a) might not be available or b) costly to generate. Furthermore, this cost increases when the goal is to make the LLM follow a specific workflow within a dialogue instead of single instructions. Inspired by the self-play technique in reinforcement learning and the use of LLMs to simulate human agents, we propose a more effective method for data collection through LLMs engaging in a conversation in various roles. This approach generates a training data via "self-talk" of LLMs that can be refined and utilized for supervised fine-tuning. We introduce an automated way to measure the (partial) success of a dialogue. This metric is used to filter the generated conversational data that is fed back in LLM for training. Based on our automated and human evaluations of conversation quality, we demonstrate that such self-talk data improves results. In addition, we examine the various characteristics that showcase the quality of generated dialogues and how they can be connected to their potential utility as training data.

Via

Access Paper or Ask Questions

Pre-training Intent-Aware Encoders for Zero- and Few-Shot Intent Classification

May 24, 2023

Mujeen Sung, James Gung, Elman Mansimov, Nikolaos Pappas, Raphael Shu, Salvatore Romeo, Yi Zhang, Vittorio Castelli

Abstract:Intent classification (IC) plays an important role in task-oriented dialogue systems as it identifies user intents from given utterances. However, models trained on limited annotations for IC often suffer from a lack of generalization to unseen intent classes. We propose a novel pre-training method for text encoders that uses contrastive learning with intent psuedo-labels to produce embeddings that are well-suited for IC tasks. By applying this pre-training strategy, we also introduce the pre-trained intent-aware encoder (PIE). Specifically, we first train a tagger to identify key phrases within utterances that are crucial for interpreting intents. We then use these extracted phrases to create examples for pre-training a text encoder in a contrastive manner. As a result, our PIE model achieves up to 5.4% and 4.0% higher accuracy than the previous state-of-the-art pre-trained sentence encoder for the N-way zero- and one-shot settings on four IC datasets.

Via

Access Paper or Ask Questions

Conversation Style Transfer using Few-Shot Learning

Feb 16, 2023

Shamik Roy, Raphael Shu, Nikolaos Pappas, Elman Mansimov, Yi Zhang, Saab Mansour, Dan Roth

Figure 1 for Conversation Style Transfer using Few-Shot Learning

Figure 2 for Conversation Style Transfer using Few-Shot Learning

Figure 3 for Conversation Style Transfer using Few-Shot Learning

Figure 4 for Conversation Style Transfer using Few-Shot Learning

Abstract:Conventional text style transfer approaches for natural language focus on sentence-level style transfer without considering contextual information, and the style is described with attributes (e.g., formality). When applying style transfer on conversations such as task-oriented dialogues, existing approaches suffer from these limitations as context can play an important role and the style attributes are often difficult to define in conversations. In this paper, we introduce conversation style transfer as a few-shot learning problem, where the model learns to perform style transfer by observing only the target-style dialogue examples. We propose a novel in-context learning approach to solve the task with style-free dialogues as a pivot. Human evaluation shows that by incorporating multi-turn context, the model is able to match the target style while having better appropriateness and semantic correctness compared to utterance-level style transfer. Additionally, we show that conversation style transfer can also benefit downstream tasks. Results on multi-domain intent classification tasks show improvement in F1 scores after transferring the style of training data to match the style of test data.

Via

Access Paper or Ask Questions

Improving Prediction Backward-Compatiblility in NLP Model Upgrade with Gated Fusion

Feb 04, 2023

Yi-An Lai, Elman Mansimov, Yuqing Xie, Yi Zhang

Figure 1 for Improving Prediction Backward-Compatiblility in NLP Model Upgrade with Gated Fusion

Figure 2 for Improving Prediction Backward-Compatiblility in NLP Model Upgrade with Gated Fusion

Figure 3 for Improving Prediction Backward-Compatiblility in NLP Model Upgrade with Gated Fusion

Figure 4 for Improving Prediction Backward-Compatiblility in NLP Model Upgrade with Gated Fusion

Abstract:When upgrading neural models to a newer version, new errors that were not encountered in the legacy version can be introduced, known as regression errors. This inconsistent behavior during model upgrade often outweighs the benefits of accuracy gain and hinders the adoption of new models. To mitigate regression errors from model upgrade, distillation and ensemble have proven to be viable solutions without significant compromise in performance. Despite the progress, these approaches attained an incremental reduction in regression which is still far from achieving backward-compatible model upgrade. In this work, we propose a novel method, Gated Fusion, that promotes backward compatibility via learning to mix predictions between old and new models. Empirical results on two distinct model upgrade scenarios show that our method reduces the number of regression errors by 62% on average, outperforming the strongest baseline by an average of 25%.

* Camera-ready for EACL 2023 Findings

Via

Access Paper or Ask Questions

Backward Compatibility During Data Updates by Weight Interpolation

Jan 25, 2023

Raphael Schumann, Elman Mansimov, Yi-An Lai, Nikolaos Pappas, Xibin Gao, Yi Zhang

Figure 1 for Backward Compatibility During Data Updates by Weight Interpolation

Figure 2 for Backward Compatibility During Data Updates by Weight Interpolation

Figure 3 for Backward Compatibility During Data Updates by Weight Interpolation

Figure 4 for Backward Compatibility During Data Updates by Weight Interpolation

Abstract:Backward compatibility of model predictions is a desired property when updating a machine learning driven application. It allows to seamlessly improve the underlying model without introducing regression bugs. In classification tasks these bugs occur in the form of negative flips. This means an instance that was correctly classified by the old model is now classified incorrectly by the updated model. This has direct negative impact on the user experience of such systems e.g. a frequently used voice assistant query is suddenly misclassified. A common reason to update the model is when new training data becomes available and needs to be incorporated. Simply retraining the model with the updated data introduces the unwanted negative flips. We study the problem of regression during data updates and propose Backward Compatible Weight Interpolation (BCWI). This method interpolates between the weights of the old and new model and we show in extensive experiments that it reduces negative flips without sacrificing the improved accuracy of the new model. BCWI is straight forward to implement and does not increase inference cost. We also explore the use of importance weighting during interpolation and averaging the weights of multiple new models in order to further reduce negative flips.

Via

Access Paper or Ask Questions

Dialog2API: Task-Oriented Dialogue with API Description and Example Programs

Dec 20, 2022

Raphael Shu, Elman Mansimov, Tamer Alkhouli, Nikolaos Pappas, Salvatore Romeo, Arshit Gupta, Saab Mansour, Yi Zhang, Dan Roth

Figure 1 for Dialog2API: Task-Oriented Dialogue with API Description and Example Programs

Figure 2 for Dialog2API: Task-Oriented Dialogue with API Description and Example Programs

Figure 3 for Dialog2API: Task-Oriented Dialogue with API Description and Example Programs

Figure 4 for Dialog2API: Task-Oriented Dialogue with API Description and Example Programs

Abstract:Functionality and dialogue experience are two important factors of task-oriented dialogue systems. Conventional approaches with closed schema (e.g., conversational semantic parsing) often fail as both the functionality and dialogue experience are strongly constrained by the underlying schema. We introduce a new paradigm for task-oriented dialogue - Dialog2API - to greatly expand the functionality and provide seamless dialogue experience. The conversational model interacts with the environment by generating and executing programs triggering a set of pre-defined APIs. The model also manages the dialogue policy and interact with the user through generating appropriate natural language responses. By allowing generating free-form programs, Dialog2API supports composite goals by combining different APIs, whereas unrestricted program revision provides natural and robust dialogue experience. To facilitate Dialog2API, the core model is provided with API documents, an execution environment and optionally some example dialogues annotated with programs. We propose an approach tailored for the Dialog2API, where the dialogue states are represented by a stack of programs, with most recently mentioned program on the top of the stack. Dialog2API can work with many application scenarios such as software automation and customer service. In this paper, we construct a dataset for AWS S3 APIs and present evaluation results of in-context learning baselines.

Via

Access Paper or Ask Questions

Label Semantic Aware Pre-training for Few-shot Text Classification

Apr 14, 2022

Aaron Mueller, Jason Krone, Salvatore Romeo, Saab Mansour, Elman Mansimov, Yi Zhang, Dan Roth

Figure 1 for Label Semantic Aware Pre-training for Few-shot Text Classification

Figure 2 for Label Semantic Aware Pre-training for Few-shot Text Classification

Figure 3 for Label Semantic Aware Pre-training for Few-shot Text Classification

Figure 4 for Label Semantic Aware Pre-training for Few-shot Text Classification

Abstract:In text classification tasks, useful information is encoded in the label names. Label semantic aware systems have leveraged this information for improved text classification performance during fine-tuning and prediction. However, use of label-semantics during pre-training has not been extensively explored. We therefore propose Label Semantic Aware Pre-training (LSAP) to improve the generalization and data efficiency of text classification systems. LSAP incorporates label semantics into pre-trained generative models (T5 in our case) by performing secondary pre-training on labeled sentences from a variety of domains. As domain-general pre-training requires large amounts of data, we develop a filtering and labeling pipeline to automatically create sentence-label pairs from unlabeled text. We perform experiments on intent (ATIS, Snips, TOPv2) and topic classification (AG News, Yahoo! Answers). LSAP obtains significant accuracy improvements over state-of-the-art models for few-shot text classification while maintaining performance comparable to state of the art in high-resource settings.

* Accepted at ACL 2022

Via

Access Paper or Ask Questions

Measuring and Reducing Model Update Regression in Structured Prediction for NLP

Feb 07, 2022

Deng Cai, Elman Mansimov, Yi-An Lai, Yixuan Su, Lei Shu, Yi Zhang

Figure 1 for Measuring and Reducing Model Update Regression in Structured Prediction for NLP

Figure 2 for Measuring and Reducing Model Update Regression in Structured Prediction for NLP

Figure 3 for Measuring and Reducing Model Update Regression in Structured Prediction for NLP

Figure 4 for Measuring and Reducing Model Update Regression in Structured Prediction for NLP

Abstract:Recent advance in deep learning has led to rapid adoption of machine learning based NLP models in a wide range of applications. Despite the continuous gain in accuracy, backward compatibility is also an important aspect for industrial applications, yet it received little research attention. Backward compatibility requires that the new model does not regress on cases that were correctly handled by its predecessor. This work studies model update regression in structured prediction tasks. We choose syntactic dependency parsing and conversational semantic parsing as representative examples of structured prediction tasks in NLP. First, we measure and analyze model update regression in different model update settings. Next, we explore and benchmark existing techniques for reducing model update regression including model ensemble and knowledge distillation. We further propose a simple and effective method, Backward-Congruent Re-ranking (BCR), by taking into account the characteristics of structured output. Experiments show that BCR can better mitigate model update regression than model ensemble and knowledge distillation approaches.

Via

Access Paper or Ask Questions