Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jun Suzuki

Detecting Response Generation Not Requiring Factual Judgment

Jun 14, 2024

Ryohei Kamei, Daiki Shiono, Reina Akama, Jun Suzuki

Figure 1 for Detecting Response Generation Not Requiring Factual Judgment

Figure 2 for Detecting Response Generation Not Requiring Factual Judgment

Figure 3 for Detecting Response Generation Not Requiring Factual Judgment

Figure 4 for Detecting Response Generation Not Requiring Factual Judgment

Abstract:With the remarkable development of large language models (LLMs), ensuring the factuality of output has become a challenge. However, having all the contents of the response with given knowledge or facts is not necessarily a good thing in dialogues. This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings. We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset. The model with the highest classification accuracy could yield about 88% accurate classification results.

Via

Access Paper or Ask Questions

A Large Collection of Model-generated Contradictory Responses for Consistency-aware Dialogue Systems

Mar 19, 2024

Shiki Sato, Reina Akama, Jun Suzuki, Kentaro Inui

Abstract:Mitigating the generation of contradictory responses poses a substantial challenge in dialogue response generation. The quality and quantity of available contradictory response data play a vital role in suppressing these contradictions, offering two significant benefits. First, having access to large contradiction data enables a comprehensive examination of their characteristics. Second, data-driven methods to mitigate contradictions may be enhanced with large-scale contradiction data for training. Nevertheless, no attempt has been made to build an extensive collection of model-generated contradictory responses. In this paper, we build a large dataset of response generation models' contradictions for the first time. Then, we acquire valuable insights into the characteristics of model-generated contradictions through an extensive analysis of the collected responses. Lastly, we also demonstrate how this dataset substantially enhances the performance of data-driven contradiction suppression methods.

* 16 pages

Via

Access Paper or Ask Questions

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

Jan 24, 2024

Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki

Abstract:We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.

* Accepted by AAAI2024; project page: https://github.com/nttmdlab-nlp/InstructDoc

Via

Access Paper or Ask Questions

Spike No More: Stabilizing the Pre-training of Large Language Models

Dec 28, 2023

Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

Figure 1 for Spike No More: Stabilizing the Pre-training of Large Language Models

Figure 2 for Spike No More: Stabilizing the Pre-training of Large Language Models

Figure 3 for Spike No More: Stabilizing the Pre-training of Large Language Models

Figure 4 for Spike No More: Stabilizing the Pre-training of Large Language Models

Abstract:The loss spike often occurs during pre-training of a large language model. The spikes degrade the performance of a large language model, and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. To investigate a cause of loss spikes, we focus on gradients of internal layers in this study. Through theoretical analyses, we introduce two causes of the exploding gradients, and provide requirements to prevent the explosion. In addition, we introduce the combination of the initialization method and a simple modification to embeddings as a method to satisfy the requirements. We conduct various experiments to verify our theoretical analyses empirically. Experimental results indicate that the combination is effective in preventing spikes during pre-training.

* Work in progress

Via

Access Paper or Ask Questions

A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video

Dec 04, 2023

Keito Kudo, Haruki Nagasawa, Jun Suzuki, Nobuyuki Shimizu

Figure 1 for A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video

Figure 2 for A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video

Figure 3 for A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video

Figure 4 for A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video

Abstract:This paper proposes a practical multimodal video summarization task setting and a dataset to train and evaluate the task. The target task involves summarizing a given video into a predefined number of keyframe-caption pairs and displaying them in a listable format to grasp the video content quickly. This task aims to extract crucial scenes from the video in the form of images (keyframes) and generate corresponding captions explaining each keyframe's situation. This task is useful as a practical application and presents a highly challenging problem worthy of study. Specifically, achieving simultaneous optimization of the keyframe selection performance and caption quality necessitates careful consideration of the mutual dependence on both preceding and subsequent keyframes and captions. To facilitate subsequent research in this field, we also construct a dataset by expanding upon existing datasets and propose an evaluation framework. Furthermore, we develop two baseline systems and report their respective performance.

Via

Access Paper or Ask Questions

Refactoring Programs Using Large Language Models with Few-Shot Examples

Nov 20, 2023

Atsushi Shirafuji, Yusuke Oda, Jun Suzuki, Makoto Morishita, Yutaka Watanobe

Figure 1 for Refactoring Programs Using Large Language Models with Few-Shot Examples

Figure 2 for Refactoring Programs Using Large Language Models with Few-Shot Examples

Figure 3 for Refactoring Programs Using Large Language Models with Few-Shot Examples

Figure 4 for Refactoring Programs Using Large Language Models with Few-Shot Examples

Abstract:A less complex and more straightforward program is a crucial factor that enhances its maintainability and makes writing secure and bug-free programs easier. However, due to its heavy workload and the risks of breaking the working programs, programmers are reluctant to do code refactoring, and thus, it also causes the loss of potential learning experiences. To mitigate this, we demonstrate the application of using a large language model (LLM), GPT-3.5, to suggest less complex versions of the user-written Python program, aiming to encourage users to learn how to write better programs. We propose a method to leverage the prompting with few-shot examples of the LLM by selecting the best-suited code refactoring examples for each target programming problem based on the prior evaluation of prompting with the one-shot example. The quantitative evaluation shows that 95.68% of programs can be refactored by generating 10 candidates each, resulting in a 17.35% reduction in the average cyclomatic complexity and a 25.84% decrease in the average number of lines after filtering only generated programs that are semantically correct. Furthermore, the qualitative evaluation shows outstanding capability in code formatting, while unnecessary behaviors such as deleting or translating comments are also observed.

* 10 pages, 10 figures, accepted to the 30th Asia-Pacific Software Engineering Conference (APSEC 2023)

Via

Access Paper or Ask Questions

Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism

Oct 23, 2023

Mengyu Ye, Tatsuki Kuribayashi, Jun Suzuki, Goro Kobayashi, Hiroaki Funayama

Figure 1 for Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism

Figure 2 for Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism

Figure 3 for Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism

Figure 4 for Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism

Abstract:Large language models (LLMs) take advantage of step-by-step reasoning instructions, e.g., chain-of-thought (CoT) prompting. Building on this, their ability to perform CoT-style reasoning robustly is of interest from a probing perspective. In this study, we inspect the step-by-step reasoning ability of LLMs with a focus on negation, which is a core linguistic phenomenon that is difficult to process. In particular, we introduce several controlled settings (e.g., reasoning in case of fictional entities) to evaluate the logical reasoning abilities of the models. We observed that dozens of modern LLMs were not robust against lexical negation (e.g., plausible ->implausible) when performing CoT-style reasoning, and the results highlight unique limitations in each LLM family.

Via

Access Paper or Ask Questions

Chat Translation Error Detection for Assisting Cross-lingual Communications

Aug 02, 2023

Yunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Ryoko Tokuhisa, Ana Brassard, Kentaro Inui

Abstract:In this paper, we describe the development of a communication support system that detects erroneous translations to facilitate crosslingual communications due to the limitations of current machine chat translation methods. We trained an error detector as the baseline of the system and constructed a new Japanese-English bilingual chat corpus, BPersona-chat, which comprises multiturn colloquial chats augmented with crowdsourced quality ratings. The error detector can serve as an encouraging foundation for more advanced erroneous translation detection systems.

* Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems, pages 88-95, November 2022, Online. Association for Computational Linguistics

Via

Access Paper or Ask Questions

Exploring the Robustness of Large Language Models for Solving Programming Problems

Jun 26, 2023

Atsushi Shirafuji, Yutaka Watanobe, Takumi Ito, Makoto Morishita, Yuki Nakamura, Yusuke Oda, Jun Suzuki

Figure 1 for Exploring the Robustness of Large Language Models for Solving Programming Problems

Figure 2 for Exploring the Robustness of Large Language Models for Solving Programming Problems

Figure 3 for Exploring the Robustness of Large Language Models for Solving Programming Problems

Figure 4 for Exploring the Robustness of Large Language Models for Solving Programming Problems

Abstract:Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.

Via

Access Paper or Ask Questions

Bipartite-play Dialogue Collection for Practical Automatic Evaluation of Dialogue Systems

Nov 19, 2022

Shiki Sato, Yosuke Kishinami, Hiroaki Sugiyama, Reina Akama, Ryoko Tokuhisa, Jun Suzuki

Abstract:Automation of dialogue system evaluation is a driving force for the efficient development of dialogue systems. This paper introduces the bipartite-play method, a dialogue collection method for automating dialogue system evaluation. It addresses the limitations of existing dialogue collection methods: (i) inability to compare with systems that are not publicly available, and (ii) vulnerability to cheating by intentionally selecting systems to be compared. Experimental results show that the automatic evaluation using the bipartite-play method mitigates these two drawbacks and correlates as strongly with human subjectivity as existing methods.

* 9 pages, Accepted to The AACL-IJCNLP 2022 Student Research Workshop (SRW)

Via

Access Paper or Ask Questions