Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jaeyoung Choe

Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents

May 26, 2025

Jaeyoung Choe, Jihoon Kim, Woohwan Jung

Figure 1 for Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents

Figure 2 for Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents

Figure 3 for Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents

Figure 4 for Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents

Abstract:Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts, and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.

* ACL 2025 (Findings)

Via

Access Paper or Ask Questions

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

Oct 20, 2023

Jaeyoung Choe, Keonwoong Noh, Nayeon Kim, Seyun Ahn, Woohwan Jung

Abstract:Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and have outperformed general-domain PLMs in specialized areas such as biomedical, scientific, and clinical domains. In addition, financial PLMs have been studied because of the high economic impact of financial data analysis. However, we found that financial PLMs were not pretrained on sufficiently diverse financial data. This lack of diverse training data leads to a subpar generalization performance, resulting in general-purpose PLMs, including BERT, often outperforming financial PLMs on many downstream tasks. To address this issue, we collected a broad range of financial corpus and trained the Financial Language Model (FiLM) on these diverse datasets. Our experimental results confirm that FiLM outperforms not only existing financial PLMs but also general domain PLMs. Furthermore, we provide empirical evidence that this improvement can be achieved even for unseen corpus groups.

* Accepted to EMNLP 2023 (Findings)

Via

Access Paper or Ask Questions