Alert button
Picture for Xin Lv

Xin Lv

Alert button

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Aug 28, 2023
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

Figure 1 for LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Figure 2 for LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Figure 3 for LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Figure 4 for LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.

* 18 pages, 6 figures 
Viaarxiv icon

VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering

Jul 06, 2023
Zijun Yao, Yuanyong Chen, Xin Lv, Shulin Cao, Amy Xin, Jifan Yu, Hailong Jin, Jianjun Xu, Peng Zhang, Lei Hou, Juanzi Li

Figure 1 for VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering
Figure 2 for VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering
Figure 3 for VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering
Figure 4 for VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering

We present Visual Knowledge oriented Programming platform (VisKoP), a knowledge base question answering (KBQA) system that integrates human into the loop to edit and debug the knowledge base (KB) queries. VisKoP not only provides a neural program induction module, which converts natural language questions into knowledge oriented program language (KoPL), but also maps KoPL programs into graphical elements. KoPL programs can be edited with simple graphical operators, such as dragging to add knowledge operators and slot filling to designate operator arguments. Moreover, VisKoP provides auto-completion for its knowledge base schema and users can easily debug the KoPL program by checking its intermediate results. To facilitate the practical KBQA on a million-entity-level KB, we design a highly efficient KoPL execution engine for the back-end. Experiment results show that VisKoP is highly efficient and user interaction can fix a large portion of wrong KoPL programs to acquire the correct answer. The VisKoP online demo https://demoviskop.xlore.cn (Stable release of this paper) and https://viskop.xlore.cn (Beta release with new features), highly efficient KoPL engine https://pypi.org/project/kopl-engine, and screencast video https://youtu.be/zAbJtxFPTXo are now publicly available.

Viaarxiv icon

KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding

Jul 06, 2023
Zijun Yao, Yantao Liu, Xin Lv, Shulin Cao, Jifan Yu, Lei Hou, Juanzi Li

Figure 1 for KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding
Figure 2 for KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding
Figure 3 for KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding
Figure 4 for KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding

Deep text understanding, which requires the connections between a given document and prior knowledge beyond its text, has been highlighted by many benchmarks in recent years. However, these benchmarks have encountered two major limitations. On the one hand, most of them require human annotation of knowledge, which leads to limited knowledge coverage. On the other hand, they usually use choices or spans in the texts as the answers, which results in narrow answer space. To overcome these limitations, we build a new challenging benchmark named KoRc in this paper. Compared with previous benchmarks, KoRC has two advantages, i.e., broad knowledge coverage and flexible answer format. Specifically, we utilize massive knowledge bases to guide annotators or large language models (LLMs) to construct knowledgable questions. Moreover, we use labels in knowledge bases rather than spans or choices as the final answers. We test state-of-the-art models on KoRC and the experimental results show that the strongest baseline only achieves 68.3% and 30.0% F1 measure in the in-distribution and out-of-distribution test set, respectively. These results indicate that deep text understanding is still an unsolved challenge. The benchmark dataset, leaderboard, and baseline methods are released in https://github.com/THU-KEG/KoRC.

Viaarxiv icon

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Jun 15, 2023
Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan Zhang, Yushi Bai, Yantao Liu, Amy Xin, Nianyi Lin, Kaifeng Yun, Linlu Gong, Jianhui Chen, Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan, Kaisheng Zeng, Ji Qi, Hailong Jin, Jinxin Liu, Yu Gu, Yuan Yao, Ning Ding, Lei Hou, Zhiyuan Liu, Bin Xu, Jie Tang, Juanzi Li

Figure 1 for KoLA: Carefully Benchmarking World Knowledge of Large Language Models
Figure 2 for KoLA: Carefully Benchmarking World Knowledge of Large Language Models
Figure 3 for KoLA: Carefully Benchmarking World Knowledge of Large Language Models
Figure 4 for KoLA: Carefully Benchmarking World Knowledge of Large Language Models

The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For ability modeling, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For data, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For evaluation criteria, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge hallucination. We evaluate $21$ open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.

Viaarxiv icon

Benchmarking Foundation Models with Language-Model-as-an-Examiner

Jun 07, 2023
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei Hou

Figure 1 for Benchmarking Foundation Models with Language-Model-as-an-Examiner
Figure 2 for Benchmarking Foundation Models with Language-Model-as-an-Examiner
Figure 3 for Benchmarking Foundation Models with Language-Model-as-an-Examiner
Figure 4 for Benchmarking Foundation Models with Language-Model-as-an-Examiner

Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise three strategies: (1) We instruct the LM examiner to generate questions across a multitude of domains to probe for a broad acquisition, and raise follow-up questions to engage in a more in-depth assessment. (2) Upon evaluation, the examiner combines both scoring and ranking measurements, providing a reliable result as it aligns closely with human annotations. (3) We additionally propose a decentralized Peer-examination method to address the biases in a single examiner. Our data and benchmarking results are available at: https://lmexam.com.

* 23 pages, 8 figures 
Viaarxiv icon

Reasoning over Hierarchical Question Decomposition Tree for Explainable Question Answering

May 24, 2023
Jiajie Zhang, Shulin Cao, Tingjia Zhang, Xin Lv, Jiaxin Shi, Qi Tian, Juanzi Li, Lei Hou

Figure 1 for Reasoning over Hierarchical Question Decomposition Tree for Explainable Question Answering
Figure 2 for Reasoning over Hierarchical Question Decomposition Tree for Explainable Question Answering
Figure 3 for Reasoning over Hierarchical Question Decomposition Tree for Explainable Question Answering
Figure 4 for Reasoning over Hierarchical Question Decomposition Tree for Explainable Question Answering

Explainable question answering (XQA) aims to answer a given question and provide an explanation why the answer is selected. Existing XQA methods focus on reasoning on a single knowledge source, e.g., structured knowledge bases, unstructured corpora, etc. However, integrating information from heterogeneous knowledge sources is essential to answer complex questions. In this paper, we propose to leverage question decomposing for heterogeneous knowledge integration, by breaking down a complex question into simpler ones, and selecting the appropriate knowledge source for each sub-question. To facilitate reasoning, we propose a novel two-stage XQA framework, Reasoning over Hierarchical Question Decomposition Tree (RoHT). First, we build the Hierarchical Question Decomposition Tree (HQDT) to understand the semantics of a complex question; then, we conduct probabilistic reasoning over HQDT from root to leaves recursively, to aggregate heterogeneous knowledge at different tree levels and search for a best solution considering the decomposing and answering probabilities. The experiments on complex QA datasets KQA Pro and Musique show that our framework outperforms SOTA methods significantly, demonstrating the effectiveness of leveraging question decomposing for knowledge integration and our RoHT framework.

* has been accepted by ACL2023 
Viaarxiv icon

Answering Complex Logical Queries on Knowledge Graphs via Query Computation Tree Optimization

Dec 21, 2022
Yushi Bai, Xin Lv, Juanzi Li, Lei Hou

Figure 1 for Answering Complex Logical Queries on Knowledge Graphs via Query Computation Tree Optimization
Figure 2 for Answering Complex Logical Queries on Knowledge Graphs via Query Computation Tree Optimization
Figure 3 for Answering Complex Logical Queries on Knowledge Graphs via Query Computation Tree Optimization
Figure 4 for Answering Complex Logical Queries on Knowledge Graphs via Query Computation Tree Optimization

Answering complex logical queries on incomplete knowledge graphs is a challenging task, and has been widely studied. Embedding-based methods require training on complex queries, and cannot generalize well to out-of-distribution query structures. Recent work frames this task as an end-to-end optimization problem, and it only requires a pretrained link predictor. However, due to the exponentially large combinatorial search space, the optimal solution can only be approximated, limiting the final accuracy. In this work, we propose QTO (Query Computation Tree Optimization) that can efficiently find the exact optimal solution. QTO finds the optimal solution by a forward-backward propagation on the tree-like computation graph, i.e., query computation tree. In particular, QTO utilizes the independence encoded in the query computation tree to reduce the search space, where only local computations are involved during the optimization procedure. Experiments on 3 datasets show that QTO obtains state-of-the-art performance on complex query answering, outperforming previous best results by an average of 22%. Moreover, QTO can interpret the intermediate solutions for each of the one-hop atoms in the query with over 90% accuracy.

* Code is available at https://github.com/bys0318/QTO 
Viaarxiv icon

Reconfigurable Intelligent Surfaces for 6G -- Applications, Challenges and Solutions

Dec 03, 2022
Yajun Zhao, Xin Lv

Figure 1 for Reconfigurable Intelligent Surfaces for 6G -- Applications, Challenges and Solutions
Figure 2 for Reconfigurable Intelligent Surfaces for 6G -- Applications, Challenges and Solutions
Figure 3 for Reconfigurable Intelligent Surfaces for 6G -- Applications, Challenges and Solutions
Figure 4 for Reconfigurable Intelligent Surfaces for 6G -- Applications, Challenges and Solutions

It is expected that scholars will continuously strengthen the depth and breadth of theoretical research on RIS, and provide a higher theoretical upper bound for the engineering application of RIS. While making breakthroughs in academic research, it has also made rapid progress in engineering application research and industrialization promotion. This paper will provide an overview of RIS engineering applications, and make a systematic and in-depth analysis of the challenges and candidate solutions of RIS engineering applications. Future trends and challenges are also provided.

* 22 
Viaarxiv icon

Step out of KG: Knowledge Graph Completion via Knowledgeable Retrieval and Reading Comprehension

Oct 12, 2022
Xin Lv, Yankai Lin, Zijun Yao, Kaisheng Zeng, Jiajie Zhang, Lei Hou, Juanzi Li

Figure 1 for Step out of KG: Knowledge Graph Completion via Knowledgeable Retrieval and Reading Comprehension
Figure 2 for Step out of KG: Knowledge Graph Completion via Knowledgeable Retrieval and Reading Comprehension
Figure 3 for Step out of KG: Knowledge Graph Completion via Knowledgeable Retrieval and Reading Comprehension
Figure 4 for Step out of KG: Knowledge Graph Completion via Knowledgeable Retrieval and Reading Comprehension

Knowledge graphs, as the cornerstone of many AI applications, usually face serious incompleteness problems. In recent years, there have been many efforts to study automatic knowledge graph completion (KGC), most of which use existing knowledge to infer new knowledge. However, in our experiments, we find that not all relations can be obtained by inference, which constrains the performance of existing models. To alleviate this problem, we propose a new model based on information retrieval and reading comprehension, namely IR4KGC. Specifically, we pre-train a knowledge-based information retrieval module that can retrieve documents related to the triples to be completed. Then, the retrieved documents are handed over to the reading comprehension module to generate the predicted answers. In experiments, we find that our model can well solve relations that cannot be inferred from existing knowledge, and achieve good results on KGC datasets.

Viaarxiv icon

Network Coexistence Analysis of RIS-Assisted Wireless Communications

Jul 27, 2022
Yajun Zhao, Xin Lv

Figure 1 for Network Coexistence Analysis of RIS-Assisted Wireless Communications
Figure 2 for Network Coexistence Analysis of RIS-Assisted Wireless Communications
Figure 3 for Network Coexistence Analysis of RIS-Assisted Wireless Communications
Figure 4 for Network Coexistence Analysis of RIS-Assisted Wireless Communications

Reconfigurable intelligent surfaces (RISs) have attracted the attention of academia and industry circles because of their ability to control the electromagnetic characteristics of channel environments. However, it has been found that the introduction of an RIS may bring new and more serious network coexistence problems. It may even further deteriorate the network performance if these new network coexistence problems cannot be effectively solved. In this paper, an RIS network coexistence model is proposed and discussed in detail, and these problems are deeply analysed. Two novel RIS design mechanisms, including a novel multilayer RIS structure with an out-of-band filter and an RIS blocking mechanism, are further explored. Finally, numerical results and a discussion are given.

* IEEE ACCESS VOLUME 10, 2022  
* 17 pages, 16 figures 
Viaarxiv icon