Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lingpeng Kong

Non-myopic Generation of Language Models for Reasoning and Planning

Oct 28, 2024

Chang Ma, Haiteng Zhao, Junlei Zhang, Junxian He, Lingpeng Kong

Figure 1 for Non-myopic Generation of Language Models for Reasoning and Planning

Figure 2 for Non-myopic Generation of Language Models for Reasoning and Planning

Figure 3 for Non-myopic Generation of Language Models for Reasoning and Planning

Figure 4 for Non-myopic Generation of Language Models for Reasoning and Planning

Abstract:Large Language Models have demonstrated remarkable abilities in reasoning and planning by breaking down complex problems into sequential steps. Despite their success in various domains like mathematical problem-solving and coding, LLMs face challenges in ensuring reliable and optimal planning due to their inherent myopic nature of autoregressive decoding. This paper revisits LLM reasoning from an optimal-control perspective, proposing a novel method, Predictive-Decoding, that leverages Model Predictive Control to enhance planning accuracy. By re-weighting LLM distributions based on foresight trajectories, Predictive-Decoding aims to mitigate early errors and promote non-myopic planning. Our experiments show significant improvements in a wide range of tasks for math, coding, and agents. Furthermore, Predictive-Decoding demonstrates computational efficiency, outperforming search baselines with reduced computational resources. This study provides insights into optimizing LLM planning capabilities.

Via

Access Paper or Ask Questions

Why Does the Effective Context Length of LLMs Fall Short?

Oct 24, 2024

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong

Figure 1 for Why Does the Effective Context Length of LLMs Fall Short?

Figure 2 for Why Does the Effective Context Length of LLMs Fall Short?

Figure 3 for Why Does the Effective Context Length of LLMs Fall Short?

Figure 4 for Why Does the Effective Context Length of LLMs Fall Short?

Abstract:Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.

Via

Access Paper or Ask Questions

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Oct 23, 2024

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han(+2 more)

Figure 1 for Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Figure 2 for Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Figure 3 for Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Figure 4 for Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Abstract:Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR language models, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (with 127M, 355M, and 7B parameters) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions \url{https://github.com/HKUNLP/DiffuLLaMA}.

* 25 pages. Code: https://github.com/HKUNLP/DiffuLLaMA

Via

Access Paper or Ask Questions

Non-myopic Generation of Language Model for Reasoning and Planning

Oct 23, 2024

Chang Ma, Haiteng Zhao, Junlei Zhang, Junxian He, Lingpeng Kong

Figure 1 for Non-myopic Generation of Language Model for Reasoning and Planning

Figure 2 for Non-myopic Generation of Language Model for Reasoning and Planning

Figure 3 for Non-myopic Generation of Language Model for Reasoning and Planning

Figure 4 for Non-myopic Generation of Language Model for Reasoning and Planning

Via

Access Paper or Ask Questions

Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration

Oct 22, 2024

Qintong Li, Jiahui Gao, Sheng Wang, Renjie Pi, Xueliang Zhao, Chuan Wu, Xin Jiang, Zhenguo Li, Lingpeng Kong

Figure 1 for Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration

Figure 2 for Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration

Figure 3 for Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration

Figure 4 for Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration

Abstract:Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data, leading to impressive performance across a range of downstream applications. Current methods often rely on human-annotated data or predefined task templates to direct powerful LLMs in synthesizing task-relevant data for effective model training. However, this dependence on manually designed components may constrain the scope of generated data, potentially overlooking critical edge cases or novel scenarios that could challenge the model. In this paper, we present a novel approach, ReverseGen, designed to automatically generate effective training samples that expose the weaknesses of LLMs. Specifically, we introduce a dedicated proposer trained to produce queries that lead target models to generate unsatisfactory responses. These failure-inducing queries are then used to construct training data, helping to address the models' shortcomings and improve overall performance. Our approach is flexible and can be applied to models of various scales (3B, 7B, and 8B). We evaluate ReverseGen on three key applications (safety, honesty, and math), demonstrating that our generated data is both highly effective and diverse. Models fine-tuned with ReverseGen-generated data consistently outperform those trained on human-annotated or general model-generated data, offering a new perspective on data synthesis for task-specific LLM enhancement.

Via

Access Paper or Ask Questions

Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

Oct 18, 2024

Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, Lingpeng Kong

Figure 1 for Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

Figure 2 for Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

Figure 3 for Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

Figure 4 for Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

Abstract:Autoregressive language models, despite their impressive capabilities, struggle with complex reasoning and long-term planning tasks. We introduce discrete diffusion models as a novel solution to these challenges. Through the lens of subgoal imbalance, we demonstrate how diffusion models effectively learn difficult subgoals that elude autoregressive approaches. We propose Multi-granularity Diffusion Modeling (MDM), which prioritizes subgoals based on difficulty during learning. On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems, MDM significantly outperforms autoregressive models without using search techniques. For instance, MDM achieves 91.5\% and 100\% accuracy on Countdown and Sudoku, respectively, compared to 45.8\% and 20.7\% for autoregressive models. Our work highlights the potential of diffusion-based approaches in advancing AI capabilities for sophisticated language understanding and problem-solving tasks.

Via

Access Paper or Ask Questions

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Oct 18, 2024

Jingqi Zhou, Sheng Wang, Jingwei Dong, Lei Li, Jiahui Gao, Lingpeng Kong, Chuan Wu

Figure 1 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Figure 2 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Figure 3 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Figure 4 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Abstract:Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., insufficient and irrelevant visual descriptions, and limited multi-modal capacities). We then decompose visual reasoning process into two stages: visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features multi-run proactive perception and decoupled vision-reasoning capabilities. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms both existing multi-step reasoning frameworks and passive peer methods on a wide range of benchmarks for both open-source and closed-source models. In addition, with the assistance of LLMs, ProReason achieves a performance improvement of up to 15% on MMMU benchmark. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.

Via

Access Paper or Ask Questions

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Oct 16, 2024

Botian Jiang, Lei Li, Xiaonan Li, Zhaowei Li, Xiachong Feng, Lingpeng Kong, Qi Liu, Xipeng Qiu

Figure 1 for Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Figure 2 for Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Figure 3 for Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Figure 4 for Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has been accompanied by the development of various benchmarks to evaluate their capabilities. However, the true nature of these evaluations and the extent to which they assess multimodal reasoning versus merely leveraging the underlying Large Language Model (LLM) backbone remain unclear. This paper presents a comprehensive investigation into the role of LLM backbones in MLLM evaluation, focusing on two critical aspects: the degree to which current benchmarks truly assess multimodal reasoning and the influence of LLM prior knowledge on performance. Specifically, we introduce a modified evaluation protocol to disentangle the contributions of the LLM backbone from multimodal integration, and an automatic knowledge identification technique for diagnosing whether LLMs equip the necessary knowledge for corresponding multimodal questions. Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs. Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50\% of error rates can be attributed to insufficient world knowledge in the LLM backbone, indicating a heavy reliance on language capabilities. To address knowledge deficiencies, we propose a knowledge augmentation pipeline that achieves significant performance gains, with improvements of up to 60\% on certain datasets, resulting in a approximately 4x increase in performance. Our work provides crucial insights into the role of the LLM backbone in MLLMs, and highlights the need for more nuanced benchmarking approaches.

Via

Access Paper or Ask Questions

QSpec: Speculative Decoding with Complementary Quantization Schemes

Oct 15, 2024

Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu

Figure 1 for QSpec: Speculative Decoding with Complementary Quantization Schemes

Figure 2 for QSpec: Speculative Decoding with Complementary Quantization Schemes

Figure 3 for QSpec: Speculative Decoding with Complementary Quantization Schemes

Figure 4 for QSpec: Speculative Decoding with Complementary Quantization Schemes

Abstract:Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models (LLMs). While activation-weight joint quantization speeds up the inference process through low-precision kernels, we demonstrate that it suffers severe performance degradation on multi-step reasoning tasks, rendering it ineffective. We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding. Leveraging nearly cost-free execution switching, QSPEC drafts tokens with low-precision, fast activation-weight quantization, and verifies them with high-precision weight-only quantization, effectively combining the strengths of both quantization schemes. Compared to high-precision quantization methods, QSPEC empirically boosts token generation throughput by up to 1.80x without any quality compromise, distinguishing it from other low-precision quantization approaches. This enhancement is also consistent across various serving tasks, model sizes, quantization methods, and batch sizes. Unlike existing speculative decoding techniques, our approach reuses weights and the KV cache, avoiding additional memory overhead. Furthermore, QSPEC offers a plug-and-play advantage without requiring any training. We believe that QSPEC demonstrates unique strengths for future deployment of high-fidelity quantization schemes, particularly in memory-constrained scenarios (e.g., edge devices).

Via

Access Paper or Ask Questions

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Oct 14, 2024

Haochuan Wang, Xiachong Feng, Lei Li, Zhanyue Qin, Dianbo Sui, Lingpeng Kong

Figure 1 for TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Figure 2 for TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Figure 3 for TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Figure 4 for TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Abstract:The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench's challenges.

Via

Access Paper or Ask Questions