Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ukyo Honda

Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?

Apr 21, 2026

Sho Hoshino, Ukyo Honda, Peinan Zhang

Abstract:While self-consistency is known to improve performance on symbolic reasoning, its effect on the recall of encyclopedic knowledge is unclear due to a lack of targeted evaluation grounds. To address this, we establish such a knowledge recall split for the popular MMLU benchmark by applying a data-driven heuristic from prior work. We validate this split by showing that the performance patterns on the symbolic reasoning and knowledge recall subsets mirror those of GSM8K and MedMCQA, respectively. Using this solid ground, we find that self-consistency consistently improves performance across both symbolic reasoning and knowledge recall, even though its underlying CoT prompting is primarily effective for symbolic reasoning. As a result, we achieve an 89\% accuracy on MMLU, the best performance to date with the use of GPT-4o.

* ACL 2026

Via

Access Paper or Ask Questions

Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective

Apr 10, 2026

Tokio Kajitsuka, Ukyo Honda, Sho Takase

Abstract:Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student's pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.

* 19 pages, 6 figures

Via

Access Paper or Ask Questions

Exploring the Relationship Between Diversity and Quality in Ad Text Generation

May 22, 2025

Yoichi Aoki, Soichiro Murakami, Ukyo Honda, Akihiko Kato

Figure 1 for Exploring the Relationship Between Diversity and Quality in Ad Text Generation

Figure 2 for Exploring the Relationship Between Diversity and Quality in Ad Text Generation

Figure 3 for Exploring the Relationship Between Diversity and Quality in Ad Text Generation

Figure 4 for Exploring the Relationship Between Diversity and Quality in Ad Text Generation

Abstract:In natural language generation for advertising, creating diverse and engaging ad texts is crucial for capturing a broad audience and avoiding advertising fatigue. Regardless of the importance of diversity, the impact of the diversity-enhancing methods in ad text generation -- mainly tested on tasks such as summarization and machine translation -- has not been thoroughly explored. Ad text generation significantly differs from these tasks owing to the text style and requirements. This research explores the relationship between diversity and ad quality in ad text generation by considering multiple factors, such as diversity-enhancing methods, their hyperparameters, input-output formats, and the models.

Via

Access Paper or Ask Questions

FaithCAMERA: Construction of a Faithful Dataset for Ad Text Generation

Oct 04, 2024

Akihiko Kato, Masato Mita, Soichiro Murakami, Ukyo Honda, Sho Hoshino, Peinan Zhang

Figure 1 for FaithCAMERA: Construction of a Faithful Dataset for Ad Text Generation

Figure 2 for FaithCAMERA: Construction of a Faithful Dataset for Ad Text Generation

Figure 3 for FaithCAMERA: Construction of a Faithful Dataset for Ad Text Generation

Figure 4 for FaithCAMERA: Construction of a Faithful Dataset for Ad Text Generation

Abstract:In ad text generation (ATG), desirable ad text is both faithful and informative. That is, it should be faithful to the input document, while at the same time containing important information that appeals to potential customers. The existing evaluation data, CAMERA (arXiv:2309.12030), is suitable for evaluating informativeness, as it consists of reference ad texts created by ad creators. However, these references often include information unfaithful to the input, which is a notable obstacle in promoting ATG research. In this study, we collaborate with in-house ad creators to refine the CAMERA references and develop an alternative ATG evaluation dataset called FaithCAMERA, in which the faithfulness of references is guaranteed. Using FaithCAMERA, we can evaluate how well existing methods for improving faithfulness can generate informative ad text while maintaining faithfulness. Our experiments show that removing training data that contains unfaithful entities improves the faithfulness and informativeness at the entity level, but decreases both at the sentence level. This result suggests that for future ATG research, it is essential not only to scale the training data but also to ensure their faithfulness. Our dataset will be publicly available.

* For dataset, see https://github.com/CyberAgentAILab/FaithCAMERA

Via

Access Paper or Ask Questions

Not Eliminate but Aggregate: Post-Hoc Control over Mixture-of-Experts to Address Shortcut Shifts in Natural Language Understanding

Jun 17, 2024

Ukyo Honda, Tatsushi Oka, Peinan Zhang, Masato Mita

Abstract:Recent models for natural language understanding are inclined to exploit simple patterns in datasets, commonly known as shortcuts. These shortcuts hinge on spurious correlations between labels and latent features existing in the training data. At inference time, shortcut-dependent models are likely to generate erroneous predictions under distribution shifts, particularly when some latent features are no longer correlated with the labels. To avoid this, previous studies have trained models to eliminate the reliance on shortcuts. In this study, we explore a different direction: pessimistically aggregating the predictions of a mixture-of-experts, assuming each expert captures relatively different latent features. The experimental results demonstrate that our post-hoc control over the experts significantly enhances the model's robustness to the distribution shift in shortcuts. Besides, we show that our approach has some practical advantages. We also analyze our model and provide results to support the assumption.

* Accepted to TACL (pre-MIT Press publication version, 21 pages, 5 figures)

Via

Access Paper or Ask Questions

Annotation-Efficient Preference Optimization for Language Model Alignment

May 22, 2024

Yuu Jinnai, Ukyo Honda

Figure 1 for Annotation-Efficient Preference Optimization for Language Model Alignment

Figure 2 for Annotation-Efficient Preference Optimization for Language Model Alignment

Figure 3 for Annotation-Efficient Preference Optimization for Language Model Alignment

Figure 4 for Annotation-Efficient Preference Optimization for Language Model Alignment

Abstract:Preference optimization is a standard approach to fine-tuning large language models to align with human preferences. The quality, diversity, and quantity of the preference dataset are critical to the effectiveness of preference optimization. However, obtaining a large amount of high-quality and diverse preference annotations is difficult in many applications. This raises the question of how to use the limited annotation budget to create an effective preference dataset. To this end, we propose Annotation-Efficient Preference Optimization (AEPO). Instead of exhaustively annotating preference over all available response texts, AEPO selects a subset of responses that maximizes quality and diversity from the available responses, and then annotates preference over the selected ones. In this way, AEPO focuses the annotation budget on labeling preference over a smaller subset of responses with diversity and of high quality. We evaluate the performance of Direct Preference Optimization (DPO) using AEPO and show that it outperforms models trained using a standard DPO with the same annotation budget. Our code is available at https://github.com/CyberAgentAILab/annotation-efficient-po

Via

Access Paper or Ask Questions

Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

May 02, 2024

Hao Wang, Tetsuro Morimura, Ukyo Honda, Daisuke Kawahara

Figure 1 for Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

Figure 2 for Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

Figure 3 for Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

Figure 4 for Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

Abstract:Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT). However, a performance gap exists between NAR and autoregressive models due to the large decoding space and difficulty in capturing dependency between target words accurately. Compounding this, preparing appropriate training data for NAR models is a non-trivial task, often exacerbating exposure bias. To address these challenges, we apply reinforcement learning (RL) to Levenshtein Transformer, a representative edit-based NAR model, demonstrating that RL with self-generated data can enhance the performance of edit-based NAR models. We explore two RL approaches: stepwise reward maximization and episodic reward maximization. We discuss the respective pros and cons of these two approaches and empirically verify them. Moreover, we experimentally investigate the impact of temperature setting on performance, confirming the importance of proper temperature setting for NAR models' training.

Via

Access Paper or Ask Questions

On the True Distribution Approximation of Minimum Bayes-Risk Decoding

Mar 31, 2024

Atsumoto Ohashi, Ukyo Honda, Tetsuro Morimura, Yuu Jinnai

Abstract:Minimum Bayes-risk (MBR) decoding has recently gained renewed attention in text generation. MBR decoding considers texts sampled from a model as pseudo-references and selects the text with the highest similarity to the others. Therefore, sampling is one of the key elements of MBR decoding, and previous studies reported that the performance varies by sampling methods. From a theoretical standpoint, this performance variation is likely tied to how closely the samples approximate the true distribution of references. However, this approximation has not been the subject of in-depth study. In this study, we propose using anomaly detection to measure the degree of approximation. We first closely examine the performance variation and then show that previous hypotheses about samples do not correlate well with the variation, but our introduced anomaly scores do. The results are the first to empirically support the link between the performance and the core assumption of MBR decoding.

* NAACL 2024 (main conference)

Via

Access Paper or Ask Questions

A Single Linear Layer Yields Task-Adapted Low-Rank Matrices

Mar 22, 2024

Hwichan Kim, Shota Sasaki, Sho Hoshino, Ukyo Honda

Figure 1 for A Single Linear Layer Yields Task-Adapted Low-Rank Matrices

Figure 2 for A Single Linear Layer Yields Task-Adapted Low-Rank Matrices

Figure 3 for A Single Linear Layer Yields Task-Adapted Low-Rank Matrices

Figure 4 for A Single Linear Layer Yields Task-Adapted Low-Rank Matrices

Abstract:Low-Rank Adaptation (LoRA) is a widely used Parameter-Efficient Fine-Tuning (PEFT) method that updates an initial weight matrix $W_0$ with a delta matrix $\Delta W$ consisted by two low-rank matrices $A$ and $B$. A previous study suggested that there is correlation between $W_0$ and $\Delta W$. In this study, we aim to delve deeper into relationships between $W_0$ and low-rank matrices $A$ and $B$ to further comprehend the behavior of LoRA. In particular, we analyze a conversion matrix that transform $W_0$ into low-rank matrices, which encapsulates information about the relationships. Our analysis reveals that the conversion matrices are similar across each layer. Inspired by these findings, we hypothesize that a single linear layer, which takes each layer's $W_0$ as input, can yield task-adapted low-rank matrices. To confirm this hypothesis, we devise a method named Conditionally Parameterized LoRA (CondLoRA) that updates initial weight matrices with low-rank matrices derived from a single linear layer. Our empirical results show that CondLoRA maintains a performance on par with LoRA, despite the fact that the trainable parameters of CondLoRA are fewer than those of LoRA. Therefore, we conclude that "a single linear layer yields task-adapted low-rank matrices."

* Accepted at LREC-COLING 2024

Via

Access Paper or Ask Questions

Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding

Jan 10, 2024

Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, Peinan Zhang

Figure 1 for Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding

Figure 2 for Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding

Figure 3 for Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding

Figure 4 for Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding

Abstract:One of the most important challenges in text generation systems is to produce outputs that are not only correct but also diverse. Recently, Minimum Bayes-Risk (MBR) decoding has gained prominence for generating sentences of the highest quality among the decoding algorithms. However, existing algorithms proposed for generating diverse outputs are predominantly based on beam search or random sampling, thus their output quality is capped by these underlying methods. In this paper, we investigate an alternative approach -- we develop diversity-promoting decoding algorithms by enforcing diversity objectives to MBR decoding. We propose two variants of MBR, Diverse MBR (DMBR) and $k$-medoids MBR (KMBR), methods to generate a set of sentences with high quality and diversity. We evaluate DMBR and KMBR on a variety of directed text generation tasks using encoder-decoder models and a large language model with prompting. The experimental results show that the proposed method achieves a better trade-off than the diverse beam search and sampling algorithms.

Via

Access Paper or Ask Questions