Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hongseok Oh

Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs' Legal Reasoning Capabilities

Dec 31, 2025

Hongseok Oh, Wonseok Hwang, Kyoung-Woon On

Abstract:We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models' legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at https://github.com/lbox-kr/kcl.

Via

Access Paper or Ask Questions

LARGE: Legal Retrieval Augmented Generation Evaluation Tool

Apr 02, 2025

Minhu Park, Hongseok Oh, Eunkyung Choi, Wonseok Hwang

Figure 1 for LARGE: Legal Retrieval Augmented Generation Evaluation Tool

Figure 2 for LARGE: Legal Retrieval Augmented Generation Evaluation Tool

Figure 3 for LARGE: Legal Retrieval Augmented Generation Evaluation Tool

Figure 4 for LARGE: Legal Retrieval Augmented Generation Evaluation Tool

Abstract:Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice. Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents. However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy. We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above. The source code is available at https://github.com/hoorangyee/LRAGE.

* 12 pages

Via

Access Paper or Ask Questions

Vision-Encoders (Already) Know What They See: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

Feb 27, 2025

Hongseok Oh, Wonseok Hwang

Figure 1 for Vision-Encoders (Already) Know What They See: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

Figure 2 for Vision-Encoders (Already) Know What They See: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

Figure 3 for Vision-Encoders (Already) Know What They See: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

Figure 4 for Vision-Encoders (Already) Know What They See: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

Abstract:Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. This study revisits the previous claim that the primary cause of such hallucination lies in the limited representational capacity of the vision encoder. Our analysis reveals that the capacity of the vision encoder itself is already enough for detecting object hallucination. Based on this insight, we propose a Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun phrase level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of 39.6% without additional training. We further validate F-CLIPScore by showing that LVLM trained with the data filtered using F-CLIPScore exhibits reduced hallucination.

* 4 pages

Via

Access Paper or Ask Questions

Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation

Oct 23, 2024

Myeonghoon Ryu, Hongseok Oh, Suji Lee, Han Park

Figure 1 for Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation

Figure 2 for Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation

Figure 3 for Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation

Figure 4 for Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation

Abstract:In this study, we introduce Unified Microphone Conversion, a unified generative framework to enhance the resilience of sound event classification systems against device variability. Building on the limitations of previous works, we condition the generator network with frequency response information to achieve many-to-many device mapping. This approach overcomes the inherent limitation of CycleGAN, requiring separate models for each device pair. Our framework leverages the strengths of CycleGAN for unpaired training to simulate device characteristics in audio recordings and significantly extends its scalability by integrating frequency response related information via Feature-wise Linear Modulation. The experiment results show that our method outperforms the state-of-the-art method by 2.6% and reducing variability by 0.8% in macro-average F1 score.

* Currently under review for ICASSP 2025

Via

Access Paper or Ask Questions

Does Alignment Tuning Really Break LLMs' Internal Confidence?

Aug 31, 2024

Hongseok Oh, Wonseok Hwang

Figure 1 for Does Alignment Tuning Really Break LLMs' Internal Confidence?

Figure 2 for Does Alignment Tuning Really Break LLMs' Internal Confidence?

Figure 3 for Does Alignment Tuning Really Break LLMs' Internal Confidence?

Abstract:Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods. Initial analysis showed that the relationship between alignment and calibration is not always a trade-off, but under stricter analysis conditions, we found the alignment process consistently harms calibration. This highlights the need for (1) a careful approach when measuring model confidences and calibration errors and (2) future research into algorithms that can help LLMs to achieve both instruction-following and calibration without sacrificing either.

Via

Access Paper or Ask Questions

On the Consideration of AI Openness: Can Good Intent Be Abused?

Mar 11, 2024

Yeeun Kim, Eunkyung Choi, Hyunjun Kim, Hongseok Oh, Hyunseo Shin, Wonseok Hwang

Figure 1 for On the Consideration of AI Openness: Can Good Intent Be Abused?

Figure 2 for On the Consideration of AI Openness: Can Good Intent Be Abused?

Figure 3 for On the Consideration of AI Openness: Can Good Intent Be Abused?

Figure 4 for On the Consideration of AI Openness: Can Good Intent Be Abused?

Abstract:Openness is critical for the advancement of science. In particular, recent rapid progress in AI has been made possible only by various open-source models, datasets, and libraries. However, this openness also means that technologies can be freely used for socially harmful purposes. Can open-source models or datasets be used for malicious purposes? If so, how easy is it to adapt technology for such goals? Here, we conduct a case study in the legal domain, a realm where individual decisions can have profound social consequences. To this end, we build EVE, a dataset consisting of 200 examples of questions and corresponding answers about criminal activities based on 200 Korean precedents. We found that a widely accepted open-source LLM, which initially refuses to answer unethical questions, can be easily tuned with EVE to provide unethical and informative answers about criminal activities. This implies that although open-source technologies contribute to scientific progress, some care must be taken to mitigate possible malicious use cases. Warning: This paper contains contents that some may find unethical.

* 10 pages

Via

Access Paper or Ask Questions

Microphone Conversion: Mitigating Device Variability in Sound Event Classification

Jan 12, 2024

Myeonghoon Ryu, Hongseok Oh, Suji Lee, Han Park

Figure 1 for Microphone Conversion: Mitigating Device Variability in Sound Event Classification

Figure 2 for Microphone Conversion: Mitigating Device Variability in Sound Event Classification

Figure 3 for Microphone Conversion: Mitigating Device Variability in Sound Event Classification

Figure 4 for Microphone Conversion: Mitigating Device Variability in Sound Event Classification

Abstract:In this study, we introduce a new augmentation technique to enhance the resilience of sound event classification (SEC) systems against device variability through the use of CycleGAN. We also present a unique dataset to evaluate this method. As SEC systems become increasingly common, it is crucial that they work well with audio from diverse recording devices. Our method addresses limited device diversity in training data by enabling unpaired training to transform input spectrograms as if they are recorded on a different device. Our experiments show that our approach outperforms existing methods in generalization by 5.2% - 11.5% in weighted f1 score. Additionally, it surpasses the current methods in adaptability across diverse recording devices by achieving a 6.5% - 12.8% improvement in weighted f1 score.

* Accepted to ICASSP 2024

Via

Access Paper or Ask Questions