Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kyomin Jung

SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models

Oct 25, 2024

Jahyun Koo, Yerin Hwang, Yongil Kim, Taegwan Kang, Hyunkyung Bae, Kyomin Jung

Abstract:Despite the success of Large Language Models (LLMs), they still face challenges related to high inference costs and memory requirements. To address these issues, Knowledge Distillation (KD) has emerged as a popular method for model compression, with student-generated outputs (SGOs) being particularly notable for reducing the mismatch between training and inference. However, SGOs often produce noisy and biased sequences, which can lead to misguidance from the teacher model, especially in long sequences. To mitigate these challenges, we propose SWITCH (Studying WIth TeaCHer for Knowledge Distillation), a novel approach that strategically incorporates the teacher model during the student's sequence generation. SWITCH identifies discrepancies between the token probabilities of the teacher and student models, allowing the teacher to intervene selectively, particularly in long sequences that are more prone to teacher misguidance. Extensive experimental results across three model families and five instruction-following datasets show that SWITCH surpasses traditional KD methods, particularly excelling in the generation of long sequential data.

Via

Access Paper or Ask Questions

Guaranteed Generation from Large Language Models

Oct 09, 2024

Minbeom Kim, Thibaut Thonet, Jos Rozen, Hwaran Lee, Kyomin Jung, Marc Dymetman

Figure 1 for Guaranteed Generation from Large Language Models

Figure 2 for Guaranteed Generation from Large Language Models

Figure 3 for Guaranteed Generation from Large Language Models

Figure 4 for Guaranteed Generation from Large Language Models

Abstract:As large language models (LLMs) are increasingly used across various applications, there is a growing need to control text generation to satisfy specific constraints or requirements. This raises a crucial question: Is it possible to guarantee strict constraint satisfaction in generated outputs while preserving the distribution of the original model as much as possible? We first define the ideal distribution - the one closest to the original model, which also always satisfies the expressed constraint - as the ultimate goal of guaranteed generation. We then state a fundamental limitation, namely that it is impossible to reach that goal through autoregressive training alone. This motivates the necessity of combining training-time and inference-time methods to enforce such guarantees. Based on this insight, we propose GUARD, a simple yet effective approach that combines an autoregressive proposal distribution with rejection sampling. Through GUARD's theoretical properties, we show how controlling the KL divergence between a specific proposal and the target ideal distribution simultaneously optimizes inference speed and distributional closeness. To validate these theoretical concepts, we conduct extensive experiments on two text generation settings with hard-to-satisfy constraints: a lexical constraint scenario and a sentiment reversal scenario. These experiments show that GUARD achieves perfect constraint satisfaction while almost preserving the ideal distribution with highly improved inference efficiency. GUARD provides a principled approach to enforcing strict guarantees for LLMs without compromising their generative capabilities.

* 22 pages, 11 figures

Via

Access Paper or Ask Questions

A Character-Centric Creative Story Generation via Imagination

Sep 25, 2024

Kyeongman Park, Minbeom Kim, Kyomin Jung

Figure 1 for A Character-Centric Creative Story Generation via Imagination

Figure 2 for A Character-Centric Creative Story Generation via Imagination

Figure 3 for A Character-Centric Creative Story Generation via Imagination

Figure 4 for A Character-Centric Creative Story Generation via Imagination

Abstract:Creative story generation with diverse and detailed story elements is a long-standing goal for large language models. While existing methodologies generate long and coherent stories, they fall significantly short of human capabilities in terms of diversity and character detail. To address this, we introduce a novel story generation framework called CCI (Character-centric Creative story generation via Imagination). CCI features two innovative modules for creative story generation: IG (Image-Guided Imagination) and MW (Multi-Writer model). In the IG module, we utilize DALL-E 3 to create visual representations of key story elements. The IG generates more novel and concrete characters, backgrounds, and main plots than text-only methods. The MW module uses these story elements created by IG to generate multiple description candidates for the protagonist and select the best one. This method incorporates vivid and rich character descriptions into the story. We compared the stories generated by CCI and baseline models through human evaluation and statistical analysis. The results showed significant improvements in the creativity. Furthermore, by enabling interactive multi-modal story generation with users, we have opened up possibilities for human-LLM integration in cultural development.

Via

Access Paper or Ask Questions

Persona is a Double-edged Sword: Enhancing the Zero-shot Reasoning by Ensembling the Role-playing and Neutral Prompts

Aug 16, 2024

Junseok Kim, Nakyeong Yang, Kyomin Jung

Abstract:Recent studies demonstrate that prompting an appropriate role-playing persona to an LLM improves its reasoning capability. However, assigning a proper persona is difficult since an LLM's performance is extremely sensitive to assigned prompts; therefore, personas sometimes hinder LLMs and degrade their reasoning capabilities. In this paper, we propose a novel framework, Jekyll \& Hyde, which ensembles the results of role-playing and neutral prompts to eradicate performance degradation via unilateral use of role-playing prompted LLM and enhance the robustness of an LLM's reasoning ability. Specifically, Jekyll \& Hyde collects two potential solutions from both role-playing and neutral prompts and selects a better solution after cross-checking via an LLM evaluator. However, LLM-based evaluators tend to be affected by the order of those potential solutions within the prompt when selecting the proper solution; thus, we also propose a robust LLM evaluator to mitigate the position bias. The experimental analysis demonstrates that role-playing prompts distract LLMs and degrade their reasoning abilities in 4 out of 12 datasets, even when using GPT-4. In addition, we reveal that Jekyll \& Hyde improves reasoning capabilities by selecting better choices among the potential solutions on twelve widely-used reasoning datasets. We further show that our proposed LLM evaluator outperforms other baselines, proving the LLMs' position bias is successfully mitigated.

* 13 pages, 4 figures

Via

Access Paper or Ask Questions

Fine-grained Gender Control in Machine Translation with Large Language Models

Jul 21, 2024

Minwoo Lee, Hyukhun Koh, Minsung Kim, Kyomin Jung

Figure 1 for Fine-grained Gender Control in Machine Translation with Large Language Models

Figure 2 for Fine-grained Gender Control in Machine Translation with Large Language Models

Figure 3 for Fine-grained Gender Control in Machine Translation with Large Language Models

Figure 4 for Fine-grained Gender Control in Machine Translation with Large Language Models

Abstract:In machine translation, the problem of ambiguously gendered input has been pointed out, where the gender of an entity is not available in the source sentence. To address this ambiguity issue, the task of controlled translation that takes the gender of the ambiguous entity as additional input have been proposed. However, most existing works have only considered a simplified setup of one target gender for input. In this paper, we tackle controlled translation in a more realistic setting of inputs with multiple entities and propose Gender-of-Entity (GoE) prompting method for LLMs. Our proposed method instructs the model with fine-grained entity-level gender information to translate with correct gender inflections. By utilizing four evaluation benchmarks, we investigate the controlled translation capability of LLMs in multiple dimensions and find that LLMs reach state-of-the-art performance in controlled translation. Furthermore, we discover an emergence of gender interference phenomenon when controlling the gender of multiple entities. Finally, we address the limitations of existing gender accuracy evaluation metrics and propose leveraging LLMs as an evaluator for gender inflection in machine translation.

* NAACL 2024 Main track long paper

Via

Access Paper or Ask Questions

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

Jun 17, 2024

Kang-il Lee, Minbeom Kim, Minsung Kim, Dongryeol Lee, Hyukhun Koh, Kyomin Jung

Abstract:Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.

Via

Access Paper or Ask Questions

Return of EM: Entity-driven Answer Set Expansion for QA Evaluation

Apr 24, 2024

Dongryeol Lee, Minwoo Lee, Kyungmin Min, Joonsuk Park, Kyomin Jung

Figure 1 for Return of EM: Entity-driven Answer Set Expansion for QA Evaluation

Figure 2 for Return of EM: Entity-driven Answer Set Expansion for QA Evaluation

Figure 3 for Return of EM: Entity-driven Answer Set Expansion for QA Evaluation

Figure 4 for Return of EM: Entity-driven Answer Set Expansion for QA Evaluation

Abstract:Recently, directly using large language models (LLMs) has been shown to be the most reliable method to evaluate QA models. However, it suffers from limited interpretability, high cost, and environmental harm. To address these, we propose to use soft EM with entity-driven answer set expansion. Our approach expands the gold answer set to include diverse surface forms, based on the observation that the surface forms often follow particular patterns depending on the entity type. The experimental results show that our method outperforms traditional evaluation methods by a large margin. Moreover, the reliability of our evaluation method is comparable to that of LLM-based ones, while offering the benefits of high interpretability and reduced environmental harm.

* Under Review (9 pages, 3 figures)

Via

Access Paper or Ask Questions

SKIP: Skill-Localized Prompt Tuning for Inference Speed Boost-Up

Apr 18, 2024

Nakyeong Yang, Junseok Kim, Jiwon Moon, Yunah Jang, Kyomin Jung

Abstract:Prompt-tuning methods have shown comparable performance as parameter-efficient fine-tuning (PEFT) methods in various natural language understanding tasks. However, existing prompt tuning methods still utilize the entire model architecture; thus, they fail to accelerate inference speed in the application. In this paper, we propose a novel approach called SKIll-localized Prompt tuning (SKIP), which is extremely efficient in inference time. Our method significantly enhances inference efficiency by investigating and utilizing a skill-localized subnetwork in a language model. Surprisingly, our method improves the inference speed up to 160% while pruning 52% of the parameters. Furthermore, we demonstrate that our method is applicable across various transformer-based architectures, thereby confirming its practicality and scalability.

* 6 pages

Via

Access Paper or Ask Questions

AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence

Apr 18, 2024

Minbeom Kim, Hwanhee Lee, Joonsuk Park, Hwaran Lee, Kyomin Jung

Figure 1 for AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence

Figure 2 for AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence

Figure 3 for AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence

Figure 4 for AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence

Abstract:As the integration of large language models into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas. To address this, we introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips subreddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a collective intelligence framework. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation, analyzing phenomena beyond the trade-off between helpfulness and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity.

* 19 pages, 11 figures

Via

Access Paper or Ask Questions

MP2D: An Automated Topic Shift Dialogue Generation Framework Leveraging Knowledge Graphs

Mar 09, 2024

Yerin Hwang, Yongil Kim, Yunah Jang, Jeesoo Bang, Hyunkyung Bae, Kyomin Jung

Figure 1 for MP2D: An Automated Topic Shift Dialogue Generation Framework Leveraging Knowledge Graphs

Figure 2 for MP2D: An Automated Topic Shift Dialogue Generation Framework Leveraging Knowledge Graphs

Figure 3 for MP2D: An Automated Topic Shift Dialogue Generation Framework Leveraging Knowledge Graphs

Figure 4 for MP2D: An Automated Topic Shift Dialogue Generation Framework Leveraging Knowledge Graphs

Abstract:Despite advancements in on-topic dialogue systems, effectively managing topic shifts within dialogues remains a persistent challenge, largely attributed to the limited availability of training datasets. To address this issue, we propose Multi-Passage to Dialogue (MP2D), a data generation framework that automatically creates conversational question-answering datasets with natural topic transitions. By leveraging the relationships between entities in a knowledge graph, MP2D maps the flow of topics within a dialogue, effectively mirroring the dynamics of human conversation. It retrieves relevant passages corresponding to the topics and transforms them into dialogues through the passage-to-dialogue method. Through quantitative and qualitative experiments, we demonstrate MP2D's efficacy in generating dialogue with natural topic shifts. Furthermore, this study introduces a novel benchmark for topic shift dialogues, TS-WikiDialog. Utilizing the dataset, we demonstrate that even Large Language Models (LLMs) struggle to handle topic shifts in dialogue effectively, and we showcase the performance improvements of models trained on datasets generated by MP2D across diverse topic shift dialogue tasks.

* 20 pages

Via

Access Paper or Ask Questions