Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lei Li

HyperAI Team, Xiaomi Corporation

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

Oct 26, 2024

Lei Li, Xiangxu Zhang, Xiao Zhou, Zheng Liu

Figure 1 for AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

Figure 2 for AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

Figure 3 for AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

Figure 4 for AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

Abstract:Medical information retrieval (MIR) is essential for retrieving relevant medical knowledge from diverse sources, including electronic health records, scientific literature, and medical databases. However, achieving effective zero-shot dense retrieval in the medical domain poses substantial challenges due to the lack of relevance-labeled data. In this paper, we introduce a novel approach called Self-Learning Hypothetical Document Embeddings (SL-HyDE) to tackle this issue. SL-HyDE leverages large language models (LLMs) as generators to generate hypothetical documents based on a given query. These generated documents encapsulate key medical context, guiding a dense retriever in identifying the most relevant documents. The self-learning framework progressively refines both pseudo-document generation and retrieval, utilizing unlabeled medical corpora without requiring any relevance-labeled data. Additionally, we present the Chinese Medical Information Retrieval Benchmark (CMIRB), a comprehensive evaluation framework grounded in real-world medical scenarios, encompassing five tasks and ten datasets. By benchmarking ten models on CMIRB, we establish a rigorous standard for evaluating medical information retrieval systems. Experimental results demonstrate that SL-HyDE significantly surpasses existing methods in retrieval accuracy while showcasing strong generalization and scalability across various LLM and retriever configurations. CMIRB data and evaluation code are publicly available at: https://github.com/CMIRB-benchmark/CMIRB.

* 15 pages, 3 figures

Via

Access Paper or Ask Questions

Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Oct 25, 2024

Danqing Wang, Zhuorui Ye, Fei Fang, Lei Li

Figure 1 for Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Figure 2 for Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Figure 3 for Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Figure 4 for Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Abstract:Enhancing the reasoning capabilities of large language models (LLMs) is crucial for enabling them to tackle complex, multi-step problems. Multi-agent frameworks have shown great potential in enhancing LLMs' reasoning capabilities. However, the lack of effective cooperation between LLM agents hinders their performance, especially for multi-step reasoning tasks. This paper proposes a novel cooperative multi-agent reasoning framework (CoPlanner) by separating reasoning steps and assigning distinct duties to different agents. CoPlanner consists of two LLM agents: a planning agent and a reasoning agent. The planning agent provides high-level strategic hints, while the reasoning agent follows these hints and infers answers. By training the planning agent's policy through the interactive reasoning process via Proximal Policy Optimization (PPO), the LLaMA-3-8B-based CoPlanner outperforms the previous best method by 9.94\% on LogiQA and 3.09\% on BBH. Our results demonstrate that the guidance from the planning agent and the effective cooperation between the agents contribute to the superior performance of CoPlanner in tackling multi-step reasoning problems.

* Working in progress

Via

Access Paper or Ask Questions

Why Does the Effective Context Length of LLMs Fall Short?

Oct 24, 2024

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong

Figure 1 for Why Does the Effective Context Length of LLMs Fall Short?

Figure 2 for Why Does the Effective Context Length of LLMs Fall Short?

Figure 3 for Why Does the Effective Context Length of LLMs Fall Short?

Figure 4 for Why Does the Effective Context Length of LLMs Fall Short?

Abstract:Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.

Via

Access Paper or Ask Questions

**CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation**

Oct 21, 2024

Xi Xu, Wenda Xu, Siqi Ouyang, Lei Li

Figure 1 for CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation

Figure 2 for CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation

Figure 3 for CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation

Figure 4 for CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation

Abstract:Simultaneous speech translation (SimulST) systems must balance translation quality with response time, making latency measurement crucial for evaluating their real-world performance. However, there has been a longstanding belief that current metrics yield unrealistically high latency measurements in unsegmented streaming settings. In this paper, we investigate this phenomenon, revealing its root cause in a fundamental misconception underlying existing latency evaluation approaches. We demonstrate that this issue affects not only streaming but also segment-level latency evaluation across different metrics. Furthermore, we propose a modification to correctly measure computation-aware latency for SimulST systems, addressing the limitations present in existing metrics.

Via

Access Paper or Ask Questions

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Oct 18, 2024

Jingqi Zhou, Sheng Wang, Jingwei Dong, Lei Li, Jiahui Gao, Lingpeng Kong, Chuan Wu

Figure 1 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Figure 2 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Figure 3 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Figure 4 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Abstract:Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., insufficient and irrelevant visual descriptions, and limited multi-modal capacities). We then decompose visual reasoning process into two stages: visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features multi-run proactive perception and decoupled vision-reasoning capabilities. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms both existing multi-step reasoning frameworks and passive peer methods on a wide range of benchmarks for both open-source and closed-source models. In addition, with the assistance of LLMs, ProReason achieves a performance improvement of up to 15% on MMMU benchmark. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.

Via

Access Paper or Ask Questions

Revealing the Barriers of Language Agents in Planning

Oct 16, 2024

Jian Xie, Kexun Zhang, Jiangjie Chen, Siyu Yuan, Kai Zhang, Yikai Zhang, Lei Li, Yanghua Xiao

Figure 1 for Revealing the Barriers of Language Agents in Planning

Figure 2 for Revealing the Barriers of Language Agents in Planning

Figure 3 for Revealing the Barriers of Language Agents in Planning

Figure 4 for Revealing the Barriers of Language Agents in Planning

Abstract:Autonomous planning has been an ongoing pursuit since the inception of artificial intelligence. Based on curated problem solvers, early planning agents could deliver precise solutions for specific tasks but lacked generalization. The emergence of large language models (LLMs) and their powerful reasoning capabilities has reignited interest in autonomous planning by automatically generating reasonable solutions for given tasks. However, prior research and our experiments show that current language agents still lack human-level planning abilities. Even the state-of-the-art reasoning model, OpenAI o1, achieves only 15.6% on one of the complex real-world planning benchmarks. This highlights a critical question: What hinders language agents from achieving human-level planning? Although existing studies have highlighted weak performance in agent planning, the deeper underlying issues and the mechanisms and limitations of the strategies proposed to address them remain insufficiently understood. In this work, we apply the feature attribution study and identify two key factors that hinder agent planning: the limited role of constraints and the diminishing influence of questions. We also find that although current strategies help mitigate these challenges, they do not fully resolve them, indicating that agents still have a long way to go before reaching human-level intelligence.

* Work in Progress

Via

Access Paper or Ask Questions

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Oct 16, 2024

Botian Jiang, Lei Li, Xiaonan Li, Zhaowei Li, Xiachong Feng, Lingpeng Kong, Qi Liu, Xipeng Qiu

Figure 1 for Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Figure 2 for Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Figure 3 for Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Figure 4 for Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has been accompanied by the development of various benchmarks to evaluate their capabilities. However, the true nature of these evaluations and the extent to which they assess multimodal reasoning versus merely leveraging the underlying Large Language Model (LLM) backbone remain unclear. This paper presents a comprehensive investigation into the role of LLM backbones in MLLM evaluation, focusing on two critical aspects: the degree to which current benchmarks truly assess multimodal reasoning and the influence of LLM prior knowledge on performance. Specifically, we introduce a modified evaluation protocol to disentangle the contributions of the LLM backbone from multimodal integration, and an automatic knowledge identification technique for diagnosing whether LLMs equip the necessary knowledge for corresponding multimodal questions. Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs. Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50\% of error rates can be attributed to insufficient world knowledge in the LLM backbone, indicating a heavy reliance on language capabilities. To address knowledge deficiencies, we propose a knowledge augmentation pipeline that achieves significant performance gains, with improvements of up to 60\% on certain datasets, resulting in a approximately 4x increase in performance. Our work provides crucial insights into the role of the LLM backbone in MLLMs, and highlights the need for more nuanced benchmarking approaches.

Via

Access Paper or Ask Questions

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Oct 15, 2024

Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister

Figure 1 for Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Figure 2 for Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Figure 3 for Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Figure 4 for Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Abstract:Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student's inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.

Via

Access Paper or Ask Questions

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Oct 14, 2024

Haochuan Wang, Xiachong Feng, Lei Li, Zhanyue Qin, Dianbo Sui, Lingpeng Kong

Figure 1 for TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Figure 2 for TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Figure 3 for TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Figure 4 for TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Abstract:The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench's challenges.

Via

Access Paper or Ask Questions

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

Oct 12, 2024

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, Qi Liu

Figure 1 for VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

Figure 2 for VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

Figure 3 for VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

Figure 4 for VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

Abstract:As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and cognition tasks, reduces hallucination issues on MMHal-Bench, and exhibits enhanced resilience against red-teaming attacks. Furthermore, our analysis underscores the advantage of AI feedback, particularly in fostering preference diversity to deliver more comprehensive improvements. Our dataset, training code and models are available at https://vlf-silkie.github.io.

* EMNLP 2024 Main Conference camera-ready version. This article supersedes arXiv:2312.10665

Via

Access Paper or Ask Questions