Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Makoto Morishita

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

Apr 23, 2026

Yuto Nishida, Naoki Shikoda, Yosuke Kishinami, Ryo Fujii, Makoto Morishita, Hidetaka Kamigaito, Taro Watanabe

Abstract:Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.

* Accepted to ACL 2026 Main

Via

Access Paper or Ask Questions

TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks

Jan 30, 2026

Ryo Fujii, Makoto Morishita, Kazuki Yano, Jun Suzuki

Abstract:With the advancement of automated software engineering, research focus is increasingly shifting toward practical tasks reflecting the day-to-day work of software engineers. Among these tasks, software migration, a critical process of adapting code to evolving environments, has been largely overlooked. In this study, we introduce TimeMachine-bench, a benchmark designed to evaluate software migration in real-world Python projects. Our benchmark consists of GitHub repositories whose tests begin to fail in response to dependency updates. The construction process is fully automated, enabling live updates of the benchmark. Furthermore, we curated a human-verified subset to ensure problem solvability. We evaluated agent-based baselines built on top of 11 models, including both strong open-weight and state-of-the-art LLMs on this verified subset. Our results indicated that, while LLMs show some promise for migration tasks, they continue to face substantial reliability challenges, including spurious solutions that exploit low test coverage and unnecessary edits stemming from suboptimal tool-use strategies. Our dataset and implementation are available at https://github.com/tohoku-nlp/timemachine-bench.

* Accepted to EACL 2026 Main, camera-ready

Via

Access Paper or Ask Questions

Long-Tail Crisis in Nearest Neighbor Language Models

Mar 28, 2025

Yuto Nishida, Makoto Morishita, Hiroyuki Deguchi, Hidetaka Kamigaito, Taro Watanabe

Figure 1 for Long-Tail Crisis in Nearest Neighbor Language Models

Figure 2 for Long-Tail Crisis in Nearest Neighbor Language Models

Figure 3 for Long-Tail Crisis in Nearest Neighbor Language Models

Figure 4 for Long-Tail Crisis in Nearest Neighbor Language Models

Abstract:The $k$-nearest-neighbor language model ($k$NN-LM), one of the retrieval-augmented language models, improves the perplexity for given text by directly accessing a large datastore built from any text data during inference. A widely held hypothesis for the success of $k$NN-LM is that its explicit memory, i.e., the datastore, enhances predictions for long-tail phenomena. However, prior works have primarily shown its ability to retrieve long-tail contexts, leaving the model's performance remain underexplored in estimating the probabilities of long-tail target tokens during inference. In this paper, we investigate the behavior of $k$NN-LM on low-frequency tokens, examining prediction probability, retrieval accuracy, token distribution in the datastore, and approximation error of the product quantization. Our experimental results reveal that $k$NN-LM does not improve prediction performance for low-frequency tokens but mainly benefits high-frequency tokens regardless of long-tail contexts in the datastore.

* Accepted to NAACL 2025 Findings

Via

Access Paper or Ask Questions

MQM-Chat: Multidimensional Quality Metrics for Chat Translation

Aug 29, 2024

Yunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Kentaro Inui

Abstract:The complexities of chats pose significant challenges for machine translation models. Recognizing the need for a precise evaluation metric to address the issues of chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat). Through the experiments of five models using MQM-Chat, we observed that all models generated certain fundamental errors, while each of them has different shortcomings, such as omission, overly correcting ambiguous source content, and buzzword issues, resulting in the loss of stylized information. Our findings underscore the effectiveness of MQM-Chat in evaluating chat translation, emphasizing the importance of stylized content and dialogue consistency for future studies.

Via

Access Paper or Ask Questions

An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication

Aug 28, 2024

Yunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Kentaro Inui

Figure 1 for An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication

Figure 2 for An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication

Figure 3 for An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication

Figure 4 for An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication

* IJCNLP-AACL 2023 Student Research Workshop

Via

Access Paper or Ask Questions

Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs

Aug 08, 2024

Masashi Oshika, Makoto Morishita, Tsutomu Hirao, Ryohei Sasano, Koichi Takeda

Figure 1 for Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs

Figure 2 for Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs

Figure 3 for Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs

Figure 4 for Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs

Abstract:In recent years, neural machine translation (NMT) has been widely used in everyday life. However, the current NMT lacks a mechanism to adjust the difficulty level of translations to match the user's language level. Additionally, due to the bias in the training data for NMT, translations of simple source sentences are often produced with complex words. In particular, this could pose a problem for children, who may not be able to understand the meaning of the translations correctly. In this study, we propose a method that replaces words with high Age of Acquisitions (AoA) in translations with simpler words to match the translations to the user's level. We achieve this by using large language models (LLMs), providing a triple of a source sentence, a translation, and a target word to be replaced. We create a benchmark dataset using back-translation on Simple English Wikipedia. The experimental results obtained from the dataset show that our method effectively replaces high-AoA words with lower-AoA words and, moreover, can iteratively replace most of the high-AoA words while still maintaining high BLEU and COMET scores.

* Findings of ACL 2024

Via

Access Paper or Ask Questions

A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

May 15, 2024

Masaaki Nagata, Makoto Morishita, Katsuki Chousa, Norihito Yasuda

Figure 1 for A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

Figure 2 for A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

Figure 3 for A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

Figure 4 for A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

Abstract:Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.

* Work in progress

Via

Access Paper or Ask Questions

WikiSplit++: Easy Data Refinement for Split and Rephrase

Apr 13, 2024

Hayato Tsukagoshi, Tsutomu Hirao, Makoto Morishita, Katsuki Chousa, Ryohei Sasano, Koichi Takeda

Figure 1 for WikiSplit++: Easy Data Refinement for Split and Rephrase

Figure 2 for WikiSplit++: Easy Data Refinement for Split and Rephrase

Figure 3 for WikiSplit++: Easy Data Refinement for Split and Rephrase

Figure 4 for WikiSplit++: Easy Data Refinement for Split and Rephrase

Abstract:The task of Split and Rephrase, which splits a complex sentence into multiple simple sentences with the same meaning, improves readability and enhances the performance of downstream tasks in natural language processing (NLP). However, while Split and Rephrase can be improved using a text-to-text generation approach that applies encoder-decoder models fine-tuned with a large-scale dataset, it still suffers from hallucinations and under-splitting. To address these issues, this paper presents a simple and strong data refinement approach. Here, we create WikiSplit++ by removing instances in WikiSplit where complex sentences do not entail at least one of the simpler sentences and reversing the order of reference simple sentences. Experimental results show that training with WikiSplit++ leads to better performance than training with WikiSplit, even with fewer training instances. In particular, our approach yields significant gains in the number of splits and the entailment ratio, a proxy for measuring hallucinations.

* Accepted at LREC-COLING 2024

Via

Access Paper or Ask Questions

Generating Diverse Translation with Perturbed kNN-MT

Feb 14, 2024

Yuto Nishida, Makoto Morishita, Hidetaka Kamigaito, Taro Watanabe

Figure 1 for Generating Diverse Translation with Perturbed kNN-MT

Figure 2 for Generating Diverse Translation with Perturbed kNN-MT

Figure 3 for Generating Diverse Translation with Perturbed kNN-MT

Figure 4 for Generating Diverse Translation with Perturbed kNN-MT

Abstract:Generating multiple translation candidates would enable users to choose the one that satisfies their needs. Although there has been work on diversified generation, there exists room for improving the diversity mainly because the previous methods do not address the overcorrection problem -- the model underestimates a prediction that is largely different from the training data, even if that prediction is likely. This paper proposes methods that generate more diverse translations by introducing perturbed k-nearest neighbor machine translation (kNN-MT). Our methods expand the search space of kNN-MT and help incorporate diverse words into candidates by addressing the overcorrection problem. Our experiments show that the proposed methods drastically improve candidate diversity and control the degree of diversity by tuning the perturbation's magnitude.

* Accepted to EACL 2024 SRW

Via

Access Paper or Ask Questions

Refactoring Programs Using Large Language Models with Few-Shot Examples

Nov 20, 2023

Atsushi Shirafuji, Yusuke Oda, Jun Suzuki, Makoto Morishita, Yutaka Watanobe

Figure 1 for Refactoring Programs Using Large Language Models with Few-Shot Examples

Figure 2 for Refactoring Programs Using Large Language Models with Few-Shot Examples

Figure 3 for Refactoring Programs Using Large Language Models with Few-Shot Examples

Figure 4 for Refactoring Programs Using Large Language Models with Few-Shot Examples

Abstract:A less complex and more straightforward program is a crucial factor that enhances its maintainability and makes writing secure and bug-free programs easier. However, due to its heavy workload and the risks of breaking the working programs, programmers are reluctant to do code refactoring, and thus, it also causes the loss of potential learning experiences. To mitigate this, we demonstrate the application of using a large language model (LLM), GPT-3.5, to suggest less complex versions of the user-written Python program, aiming to encourage users to learn how to write better programs. We propose a method to leverage the prompting with few-shot examples of the LLM by selecting the best-suited code refactoring examples for each target programming problem based on the prior evaluation of prompting with the one-shot example. The quantitative evaluation shows that 95.68% of programs can be refactored by generating 10 candidates each, resulting in a 17.35% reduction in the average cyclomatic complexity and a 25.84% decrease in the average number of lines after filtering only generated programs that are semantically correct. Furthermore, the qualitative evaluation shows outstanding capability in code formatting, while unnecessary behaviors such as deleting or translating comments are also observed.

* 10 pages, 10 figures, accepted to the 30th Asia-Pacific Software Engineering Conference (APSEC 2023)

Via

Access Paper or Ask Questions