Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Baotian Hu

Improving Value-based Process Verifier via Structural Prior Injection

Feb 21, 2025

Zetian Sun, Dongfang Li, Baotian Hu, Jun Yu, Min Zhang

Figure 1 for Improving Value-based Process Verifier via Structural Prior Injection

Figure 2 for Improving Value-based Process Verifier via Structural Prior Injection

Figure 3 for Improving Value-based Process Verifier via Structural Prior Injection

Figure 4 for Improving Value-based Process Verifier via Structural Prior Injection

Abstract:In the Large Language Model(LLM) reasoning scenario, people often estimate state value via Monte Carlo sampling. Though Monte Carlo estimation is an elegant method with less inductive bias, noise and errors are inevitably introduced due to the limited sampling. To handle the problem, we inject the structural prior into the value representation and transfer the scalar value into the expectation of a pre-defined categorical distribution, representing the noise and errors from a distribution perspective. Specifically, by treating the result of Monte Carlo sampling as a single sample from the prior ground-truth Binomial distribution, we quantify the sampling error as the mismatch between posterior estimated distribution and ground-truth distribution, which is thus optimized via distribution selection optimization. We test the performance of value-based process verifiers on Best-of-N task and Beam search task. Compared with the scalar value representation, we show that reasonable structural prior injection induced by different objective functions or optimization methods can improve the performance of value-based process verifiers for about 1$\sim$2 points at little-to-no cost. We also show that under different structural prior, the verifiers' performances vary greatly despite having the same optimal solution, indicating the importance of reasonable structural prior injection.

* Preprint. Under review

Via

Access Paper or Ask Questions

KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Jan 03, 2025

Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu(+1 more)

Figure 1 for KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Figure 2 for KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Figure 3 for KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Figure 4 for KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Abstract:As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for general embedding tasks. Extensive evaluations of the MTEB benchmark across multiple languages show that our model outperforms others of comparable size, setting a new standard for multilingual embedding models with <1B parameters.

* Technical Report. 23 pages, 6 figures, 10 tables

Via

Access Paper or Ask Questions

RaSeRec: Retrieval-Augmented Sequential Recommendation

Dec 24, 2024

Xinping Zhao, Baotian Hu, Yan Zhong, Shouzheng Huang, Zihao Zheng, Meng Wang, Haofen Wang, Min zhang

Figure 1 for RaSeRec: Retrieval-Augmented Sequential Recommendation

Figure 2 for RaSeRec: Retrieval-Augmented Sequential Recommendation

Figure 3 for RaSeRec: Retrieval-Augmented Sequential Recommendation

Figure 4 for RaSeRec: Retrieval-Augmented Sequential Recommendation

Abstract:Although prevailing supervised and self-supervised learning (SSL)-augmented sequential recommendation (SeRec) models have achieved improved performance with powerful neural network architectures, we argue that they still suffer from two limitations: (1) Preference Drift, where models trained on past data can hardly accommodate evolving user preference; and (2) Implicit Memory, where head patterns dominate parametric learning, making it harder to recall long tails. In this work, we explore retrieval augmentation in SeRec, to address these limitations. To this end, we propose a Retrieval-Augmented Sequential Recommendation framework, named RaSeRec, the main idea of which is to maintain a dynamic memory bank to accommodate preference drifts and retrieve relevant memories to augment user modeling explicitly. It consists of two stages: (i) collaborative-based pre-training, which learns to recommend and retrieve; (ii) retrieval-augmented fine-tuning, which learns to leverage retrieved memories. Extensive experiments on three datasets fully demonstrate the superiority and effectiveness of RaSeRec.

* 20 pages, 8 figures, 8 tables

Via

Access Paper or Ask Questions

CMT: A Memory Compression Method for Continual Knowledge Learning of Large Language Models

Dec 10, 2024

Dongfang Li, Zetian Sun, Xinshuo Hu, Baotian Hu, Min Zhang

Figure 1 for CMT: A Memory Compression Method for Continual Knowledge Learning of Large Language Models

Figure 2 for CMT: A Memory Compression Method for Continual Knowledge Learning of Large Language Models

Figure 3 for CMT: A Memory Compression Method for Continual Knowledge Learning of Large Language Models

Figure 4 for CMT: A Memory Compression Method for Continual Knowledge Learning of Large Language Models

Abstract:Large Language Models (LLMs) need to adapt to the continuous changes in data, tasks, and user preferences. Due to their massive size and the high costs associated with training, LLMs are not suitable for frequent retraining. However, updates are necessary to keep them in sync with rapidly evolving human knowledge. To address these challenges, this paper proposes the Compression Memory Training (CMT) method, an efficient and effective online adaptation framework for LLMs that features robust knowledge retention capabilities. Inspired by human memory mechanisms, CMT compresses and extracts information from new documents to be stored in a memory bank. When answering to queries related to these new documents, the model aggregates these document memories from the memory bank to better answer user questions. The parameters of the LLM itself do not change during training and inference, reducing the risk of catastrophic forgetting. To enhance the encoding, retrieval, and aggregation of memory, we further propose three new general and flexible techniques, including memory-aware objective, self-matching and top-aggregation. Extensive experiments conducted on three continual learning datasets (i.e., StreamingQA, SQuAD and ArchivalQA) demonstrate that the proposed method improves model adaptability and robustness across multiple base LLMs (e.g., +4.07 EM & +4.19 F1 in StreamingQA with Llama-2-7b).

* AAAI 2025; Pre-print

Via

Access Paper or Ask Questions

SEER: Self-Aligned Evidence Extraction for Retrieval-Augmented Generation

Oct 15, 2024

Xinping Zhao, Dongfang Li, Yan Zhong, Boren Hu, Yibin Chen, Baotian Hu, Min Zhang

Figure 1 for SEER: Self-Aligned Evidence Extraction for Retrieval-Augmented Generation

Figure 2 for SEER: Self-Aligned Evidence Extraction for Retrieval-Augmented Generation

Figure 3 for SEER: Self-Aligned Evidence Extraction for Retrieval-Augmented Generation

Figure 4 for SEER: Self-Aligned Evidence Extraction for Retrieval-Augmented Generation

Abstract:Recent studies in Retrieval-Augmented Generation (RAG) have investigated extracting evidence from retrieved passages to reduce computational costs and enhance the final RAG performance, yet it remains challenging. Existing methods heavily rely on heuristic-based augmentation, encountering several issues: (1) Poor generalization due to hand-crafted context filtering; (2) Semantics deficiency due to rule-based context chunking; (3) Skewed length due to sentence-wise filter learning. To address these issues, we propose a model-based evidence extraction learning framework, SEER, optimizing a vanilla model as an evidence extractor with desired properties through self-aligned learning. Extensive experiments show that our method largely improves the final RAG performance, enhances the faithfulness, helpfulness, and conciseness of the extracted evidence, and reduces the evidence length by 9.25 times. The code will be available at https://github.com/HITsz-TMG/SEER.

* 15 pages, 6 figures, 5 tables. Accepted by EMNLP 2024 (main)

Via

Access Paper or Ask Questions

Enhancing Attributed Graph Networks with Alignment and Uniformity Constraints for Session-based Recommendation

Oct 14, 2024

Xinping Zhao, Chaochao Chen, Jiajie Su, Yizhao Zhang, Baotian Hu

Figure 1 for Enhancing Attributed Graph Networks with Alignment and Uniformity Constraints for Session-based Recommendation

Figure 2 for Enhancing Attributed Graph Networks with Alignment and Uniformity Constraints for Session-based Recommendation

Figure 3 for Enhancing Attributed Graph Networks with Alignment and Uniformity Constraints for Session-based Recommendation

Figure 4 for Enhancing Attributed Graph Networks with Alignment and Uniformity Constraints for Session-based Recommendation

Abstract:Session-based Recommendation (SBR), seeking to predict a user's next action based on an anonymous session, has drawn increasing attention for its practicability. Most SBR models only rely on the contextual transitions within a short session to learn item representations while neglecting additional valuable knowledge. As such, their model capacity is largely limited by the data sparsity issue caused by short sessions. A few studies have exploited the Modeling of Item Attributes (MIA) to enrich item representations. However, they usually involve specific model designs that can hardly transfer to existing attribute-agnostic SBR models and thus lack universality. In this paper, we propose a model-agnostic framework, named AttrGAU (Attributed Graph Networks with Alignment and Uniformity Constraints), to bring the MIA's superiority into existing attribute-agnostic models, to improve their accuracy and robustness for recommendation. Specifically, we first build a bipartite attributed graph and design an attribute-aware graph convolution to exploit the rich attribute semantics hidden in the heterogeneous item-attribute relationship. We then decouple existing attribute-agnostic SBR models into the graph neural network and attention readout sub-modules to satisfy the non-intrusive requirement. Lastly, we design two representation constraints, i.e., alignment and uniformity, to optimize distribution discrepancy in representation between the attribute semantics and collaborative semantics. Extensive experiments on three public benchmark datasets demonstrate that the proposed AttrGAU framework can significantly enhance backbone models' recommendation performance and robustness against data sparsity and data noise issues. Our implementation codes will be available at https://github.com/ItsukiFujii/AttrGAU.

* 11 pages, 4 figures, 5 tables. Accepted by ICWS 2024

Via

Access Paper or Ask Questions

FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG

Oct 14, 2024

Xinping Zhao, Yan Zhong, Zetian Sun, Xinshuo Hu, Zhenyu Liu, Dongfang Li, Baotian Hu, Min Zhang

Figure 1 for FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG

Figure 2 for FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG

Figure 3 for FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG

Figure 4 for FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG

Abstract:Retrieval-Augmented Generation (RAG) prevails in Large Language Models. It mainly consists of retrieval and generation. The retrieval modules (a.k.a. retrievers) aim to find useful information used to facilitate generation modules (a.k.a. generators). As such, generators' performance largely depends on the effectiveness and efficiency of retrievers. However, the retrieval paradigm that we design and use remains flat, which treats the retrieval procedures as a one-off deal with constant granularity. Despite effectiveness, we argue that they suffer from two limitations: (1) flat retrieval exerts a significant burden on one retriever; (2) constant granularity limits the ceiling of retrieval performance. In this work, we propose a progressive retrieval paradigm with coarse-to-fine granularity for RAG, termed FunnelRAG, so as to balance effectiveness and efficiency. Specifically, FunnelRAG establishes a progressive retrieval pipeline by collaborating coarse-to-fine granularity, large-to-small quantity, and low-to-high capacity, which can relieve the burden on one retriever and also promote the ceiling of retrieval performance. Extensive experiments manifest that FunnelRAG achieves comparable retrieval performance while the time overhead is reduced by nearly 40 percent.

* 18 pages, 6 figures, 13 tables

Via

Access Paper or Ask Questions

Medico: Towards Hallucination Detection and Correction with Multi-source Evidence Fusion

Oct 14, 2024

Xinping Zhao, Jindi Yu, Zhenyu Liu, Jifang Wang, Dongfang Li, Yibin Chen, Baotian Hu, Min Zhang

Figure 1 for Medico: Towards Hallucination Detection and Correction with Multi-source Evidence Fusion

Figure 2 for Medico: Towards Hallucination Detection and Correction with Multi-source Evidence Fusion

Figure 3 for Medico: Towards Hallucination Detection and Correction with Multi-source Evidence Fusion

Figure 4 for Medico: Towards Hallucination Detection and Correction with Multi-source Evidence Fusion

Abstract:As we all know, hallucinations prevail in Large Language Models (LLMs), where the generated content is coherent but factually incorrect, which inflicts a heavy blow on the widespread application of LLMs. Previous studies have shown that LLMs could confidently state non-existent facts rather than answering ``I don't know''. Therefore, it is necessary to resort to external knowledge to detect and correct the hallucinated content. Since manual detection and correction of factual errors is labor-intensive, developing an automatic end-to-end hallucination-checking approach is indeed a needful thing. To this end, we present Medico, a Multi-source evidence fusion enhanced hallucination detection and correction framework. It fuses diverse evidence from multiple sources, detects whether the generated content contains factual errors, provides the rationale behind the judgment, and iteratively revises the hallucinated content. Experimental results on evidence retrieval (0.964 HR@5, 0.908 MRR@5), hallucination detection (0.927-0.951 F1), and hallucination correction (0.973-0.979 approval rate) manifest the great potential of Medico. A video demo of Medico can be found at https://youtu.be/RtsO6CSesBI.

* 12 pages, 3 figures, 6 tables. Accepted by EMNLP 2024's demo track

Via

Access Paper or Ask Questions

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Aug 19, 2024

Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, Min Zhang

Figure 1 for Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Figure 2 for Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Figure 3 for Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Figure 4 for Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Abstract:Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director's script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output.

* Accepted by SIGGRAPH Asia 2024, Project and Codes: https://github.com/HITsz-TMG/Anim-Director

Via

Access Paper or Ask Questions

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Jun 17, 2024

Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, Min Zhang

Figure 1 for VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Figure 2 for VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Figure 3 for VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Figure 4 for VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Abstract:Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside advanced analysis tools (e.g., video splitting, object segmenting, and tracking). We also utilize this framework to construct training data to enhance the capabilities of video-related LMMs (Video-LMMs). Through a comprehensive and quantitative evaluation of cutting-edge models, we reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning.

* 38 pages, 44 figures

Via

Access Paper or Ask Questions