Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hsin-Hsi Chen

National Taiwan University

Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

Feb 08, 2026

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

Abstract:To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.

* Accepted to Findings of IJCNLP-AACL 2025

Via

Access Paper or Ask Questions

Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations

Feb 01, 2026

Sheng-Lun Wei, Yu-Ling Liao, Yen-Hua Chang, Hen-Hsen Huang, Hsin-Hsi Chen

Abstract:This work presents the first systematic investigation of speech bias in multilingual MLLMs. We construct and release the BiasInEar dataset, a speech-augmented benchmark based on Global MMLU Lite, spanning English, Chinese, and Korean, balanced by gender and accent, and totaling 70.8 hours ($\approx$4,249 minutes) of speech with 11,200 questions. Using four complementary metrics (accuracy, entropy, APES, and Fleiss' $κ$), we evaluate nine representative models under linguistic (language and accent), demographic (gender), and structural (option order) perturbations. Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases. Moreover, architectural design and reasoning strategy substantially affect robustness across languages. Overall, this study establishes a unified framework for assessing fairness and robustness in speech-integrated LLMs, bridging the gap between text- and speech-based evaluation. The resources can be found at https://github.com/ntunlplab/BiasInEar.

* Accepted as a long findings paper at EACL 2026

Via

Access Paper or Ask Questions

Hearing the Order: Investigating Selection Bias in Large Audio-Language Models

Oct 01, 2025

Yu-Xiang Lin, Chen-An Li, Sheng-Lun Wei, Po-Chun Chen, Hsin-Hsi Chen, Hung-yi Lee

Abstract:Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options. An open question is whether their predictions are influenced by the order of answer choices, which would indicate a form of selection bias and undermine their reliability. In this paper, we identify and analyze this problem in LALMs. We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts. Shuffling the order of answer options can cause performance fluctuations of up to 24% and even change model rankings, raising concerns about the reliability of current evaluation practices. We also study permutation-based strategies and show that they can mitigate bias in most cases. Our work represents the first systematic investigation of this issue in LALMs, and we hope it raises awareness and motivates further research in this direction.

* The first two authors contributed equally. Submitted to ICASSP 2026

Via

Access Paper or Ask Questions

Evaluating Large Language Models as Expert Annotators

Aug 11, 2025

Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, Hsin-Hsi Chen

Abstract:Textual data annotation, the process of labeling or tagging text with relevant information, is typically costly, time-consuming, and labor-intensive. While large language models (LLMs) have demonstrated their potential as direct alternatives to human annotators for general domains natural language processing (NLP) tasks, their effectiveness on annotation tasks in domains requiring expert knowledge remains underexplored. In this paper, we investigate: whether top-performing LLMs, which might be perceived as having expert-level proficiency in academic and professional benchmarks, can serve as direct alternatives to human expert annotators? To this end, we evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law. Specifically, we propose a multi-agent discussion framework to simulate a group of human annotators, where LLMs are tasked to engage in discussions by considering others' annotations and justifications before finalizing their labels. Additionally, we incorporate reasoning models (e.g., o3-mini) to enable a more comprehensive comparison. Our empirical results reveal that: (1) Individual LLMs equipped with inference-time techniques (e.g., chain-of-thought (CoT), self-consistency) show only marginal or even negative performance gains, contrary to prior literature suggesting their broad effectiveness. (2) Overall, reasoning models do not demonstrate statistically significant improvements over non-reasoning models in most settings. This suggests that extended long CoT provides relatively limited benefits for data annotation in specialized domains. (3) Certain model behaviors emerge in the multi-agent discussion environment. For instance, Claude 3.7 Sonnet with thinking rarely changes its initial annotations, even when other agents provide correct annotations or valid reasoning.

* Accepted to COLM 2025

Via

Access Paper or Ask Questions

Are Expert-Level Language Models Expert-Level Annotators?

Oct 04, 2024

Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, Hsin-Hsi Chen

Figure 1 for Are Expert-Level Language Models Expert-Level Annotators?

Figure 2 for Are Expert-Level Language Models Expert-Level Annotators?

Figure 3 for Are Expert-Level Language Models Expert-Level Annotators?

Figure 4 for Are Expert-Level Language Models Expert-Level Annotators?

Abstract:Data annotation refers to the labeling or tagging of textual data with relevant information. A large body of works have reported positive results on leveraging LLMs as an alternative to human annotators. However, existing studies focus on classic NLP tasks, and the extent to which LLMs as data annotators perform in domains requiring expert knowledge remains underexplored. In this work, we investigate comprehensive approaches across three highly specialized domains and discuss practical suggestions from a cost-effectiveness perspective. To the best of our knowledge, we present the first systematic evaluation of LLMs as expert-level data annotators.

* Accepted to WiML @ NeurIPS 2024 (extended version)

Via

Access Paper or Ask Questions

"Why" Has the Least Side Effect on Model Editing

Sep 27, 2024

Tsung-Hsuan Pan, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen

Figure 1 for "Why" Has the Least Side Effect on Model Editing

Figure 2 for "Why" Has the Least Side Effect on Model Editing

Abstract:Training large language models (LLMs) from scratch is an expensive endeavor, particularly as world knowledge continually evolves. To maintain relevance and accuracy of LLMs, model editing has emerged as a pivotal research area. While these methods hold promise, they can also produce unintended side effects. Their underlying factors and causes remain largely unexplored. This paper delves into a critical factor-question type-by categorizing model editing questions. Our findings reveal that the extent of performance degradation varies significantly across different question types, providing new insights for experimental design in knowledge editing. Furthermore, we investigate whether insights from smaller models can be extrapolated to larger models. Our results indicate discrepancies in findings between models of different sizes, suggesting that insights from smaller models may not necessarily apply to larger models. Additionally, we examine the impact of batch size on side effects, discovering that increasing the batch size can mitigate performance drops.

Via

Access Paper or Ask Questions

Co-Trained Retriever-Generator Framework for Question Generation in Earnings Calls

Sep 27, 2024

Yining Juan, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen

Figure 1 for Co-Trained Retriever-Generator Framework for Question Generation in Earnings Calls

Figure 2 for Co-Trained Retriever-Generator Framework for Question Generation in Earnings Calls

Figure 3 for Co-Trained Retriever-Generator Framework for Question Generation in Earnings Calls

Abstract:In diverse professional environments, ranging from academic conferences to corporate earnings calls, the ability to anticipate audience questions stands paramount. Traditional methods, which rely on manual assessment of an audience's background, interests, and subject knowledge, often fall short - particularly when facing large or heterogeneous groups, leading to imprecision and inefficiency. While NLP has made strides in text-based question generation, its primary focus remains on academic settings, leaving the intricate challenges of professional domains, especially earnings call conferences, underserved. Addressing this gap, our paper pioneers the multi-question generation (MQG) task specifically designed for earnings call contexts. Our methodology involves an exhaustive collection of earnings call transcripts and a novel annotation technique to classify potential questions. Furthermore, we introduce a retriever-enhanced strategy to extract relevant information. With a core aim of generating a spectrum of potential questions that analysts might pose, we derive these directly from earnings call content. Empirical evaluations underscore our approach's edge, revealing notable excellence in the accuracy, consistency, and perplexity of the questions generated.

Via

Access Paper or Ask Questions

Enhancing Investment Opinion Ranking through Argument-Based Sentiment Analysis

Sep 25, 2024

Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen, Hiroya Takamura, Ichiro Kobayashi, Yusuke Miyao

Figure 1 for Enhancing Investment Opinion Ranking through Argument-Based Sentiment Analysis

Figure 2 for Enhancing Investment Opinion Ranking through Argument-Based Sentiment Analysis

Figure 3 for Enhancing Investment Opinion Ranking through Argument-Based Sentiment Analysis

Figure 4 for Enhancing Investment Opinion Ranking through Argument-Based Sentiment Analysis

Abstract:In the era of rapid Internet and social media platform development, individuals readily share their viewpoints online. The overwhelming quantity of these posts renders comprehensive analysis impractical. This necessitates an efficient recommendation system to filter and present significant, relevant opinions. Our research introduces a dual-pronged argument mining technique to improve recommendation system effectiveness, considering both professional and amateur investor perspectives. Our first strategy involves using the discrepancy between target and closing prices as an opinion indicator. The second strategy applies argument mining principles to score investors' opinions, subsequently ranking them by these scores. Experimental results confirm the effectiveness of our approach, demonstrating its ability to identify opinions with higher profit potential. Beyond profitability, our research extends to risk analysis, examining the relationship between recommended opinions and investor behaviors. This offers a holistic view of potential outcomes following the adoption of these recommended opinions.

Via

Access Paper or Ask Questions

Pre-Finetuning with Impact Duration Awareness for Stock Movement Prediction

Sep 25, 2024

Chr-Jr Chiu, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen

Figure 1 for Pre-Finetuning with Impact Duration Awareness for Stock Movement Prediction

Figure 2 for Pre-Finetuning with Impact Duration Awareness for Stock Movement Prediction

Figure 3 for Pre-Finetuning with Impact Duration Awareness for Stock Movement Prediction

Figure 4 for Pre-Finetuning with Impact Duration Awareness for Stock Movement Prediction

Abstract:Understanding the duration of news events' impact on the stock market is crucial for effective time-series forecasting, yet this facet is largely overlooked in current research. This paper addresses this research gap by introducing a novel dataset, the Impact Duration Estimation Dataset (IDED), specifically designed to estimate impact duration based on investor opinions. Our research establishes that pre-finetuning language models with IDED can enhance performance in text-based stock movement predictions. In addition, we juxtapose our proposed pre-finetuning task with sentiment analysis pre-finetuning, further affirming the significance of learning impact duration. Our findings highlight the promise of this novel research direction in stock movement prediction, offering a new avenue for financial forecasting. We also provide the IDED and pre-finetuned language models under the CC BY-NC-SA 4.0 license for academic use, fostering further exploration in this field.

* NTCIR-18 FinArg-2 Dataset

Via

Access Paper or Ask Questions

Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models

Jun 05, 2024

Sheng-Lun Wei, Cheng-Kuang Wu, Hen-Hsen Huang, Hsin-Hsi Chen

Abstract:In this paper, we investigate the phenomena of "selection biases" in Large Language Models (LLMs), focusing on problems where models are tasked with choosing the optimal option from an ordered sequence. We delve into biases related to option order and token usage, which significantly impact LLMs' decision-making processes. We also quantify the impact of these biases through an extensive empirical analysis across multiple models and tasks. Furthermore, we propose mitigation strategies to enhance model performance. Our key contributions are threefold: 1) Precisely quantifying the influence of option order and token on LLMs, 2) Developing strategies to mitigate the impact of token and order sensitivity to enhance robustness, and 3) Offering a detailed analysis of sensitivity across models and tasks, which informs the creation of more stable and reliable LLM applications for selection problems.

* Accepted as a long findings paper at ACL 2024

Via

Access Paper or Ask Questions