Alert button
Picture for Xiao Wang

Xiao Wang

Alert button

The Rise and Potential of Large Language Model Based Agents: A Survey

Sep 19, 2023
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, Tao Gui

For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are artificial entities that sense their environment, make decisions, and take actions. Many efforts have been made to develop intelligent agents, but they mainly focus on advancement in algorithms or training strategies to enhance specific capabilities or performance on particular tasks. Actually, what the community lacks is a general and powerful model to serve as a starting point for designing AI agents that can adapt to diverse scenarios. Due to the versatile capabilities they demonstrate, large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI), offering hope for building general AI agents. Many researchers have leveraged LLMs as the foundation to build AI agents and have achieved significant progress. In this paper, we perform a comprehensive survey on LLM-based agents. We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents. Building upon this, we present a general framework for LLM-based agents, comprising three main components: brain, perception, and action, and the framework can be tailored for different applications. Subsequently, we explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation. Following this, we delve into agent societies, exploring the behavior and personality of LLM-based agents, the social phenomena that emerge from an agent society, and the insights they offer for human society. Finally, we discuss several key topics and open problems within the field. A repository for the related papers at

* 86 pages, 12 figures 
Viaarxiv icon

A stochastic block model for community detection in attributed networks

Aug 31, 2023
Xiao Wang, Fang Dai, Wenyan Guo, Junfeng Wang

Community detection is an important content in complex network analysis. The existing community detection methods in attributed networks mostly focus on only using network structure, while the methods of integrating node attributes is mainly for the traditional community structures, and cannot detect multipartite structures and mixture structures in network. In addition, the model-based community detection methods currently proposed for attributed networks do not fully consider unique topology information of nodes, such as betweenness centrality and clustering coefficient. Therefore, a stochastic block model that integrates betweenness centrality and clustering coefficient of nodes for community detection in attributed networks, named BCSBM, is proposed in this paper. Different from other generative models for attributed networks, the generation process of links and attributes in BCSBM model follows the Poisson distribution, and the probability between community is considered based on the stochastic block model. Moreover, the betweenness centrality and clustering coefficient of nodes are introduced into the process of links and attributes generation. Finally, the expectation maximization algorithm is employed to estimate the parameters of the BCSBM model, and the node-community memberships is obtained through the hard division process, so the community detection is completed. By experimenting on six real-work networks containing different network structures, and comparing with the community detection results of five algorithms, the experimental results show that the BCSBM model not only inherits the advantages of the stochastic block model and can detect various network structures, but also has good data fitting ability due to introducing the betweenness centrality and clustering coefficient of nodes. Overall, the performance of this model is superior to other five compared algorithms.

Viaarxiv icon

Unified and Dynamic Graph for Temporal Character Grouping in Long Videos

Aug 29, 2023
Xiujun Shu, Wei Wen, Liangsheng Xu, Mingbao Lin, Ruizhi Qiao, Taian Guo, Hanjun Li, Bei Gan, Xiao Wang, Xing Sun

Figure 1 for Unified and Dynamic Graph for Temporal Character Grouping in Long Videos
Figure 2 for Unified and Dynamic Graph for Temporal Character Grouping in Long Videos
Figure 3 for Unified and Dynamic Graph for Temporal Character Grouping in Long Videos
Figure 4 for Unified and Dynamic Graph for Temporal Character Grouping in Long Videos

Video temporal character grouping locates appearing moments of major characters within a video according to their identities. To this end, recent works have evolved from unsupervised clustering to graph-based supervised clustering. However, graph methods are built upon the premise of fixed affinity graphs, bringing many inexact connections. Besides, they extract multi-modal features with kinds of models, which are unfriendly to deployment. In this paper, we present a unified and dynamic graph (UniDG) framework for temporal character grouping. This is accomplished firstly by a unified representation network that learns representations of multiple modalities within the same space and still preserves the modality's uniqueness simultaneously. Secondly, we present a dynamic graph clustering where the neighbors of different quantities are dynamically constructed for each node via a cyclic matching strategy, leading to a more reliable affinity graph. Thirdly, a progressive association method is introduced to exploit spatial and temporal contexts among different modalities, allowing multi-modal clustering results to be well fused. As current datasets only provide pre-extracted features, we evaluate our UniDG method on a collected dataset named MTCG, which contains each character's appearing clips of face and body and speaking voice tracks. We also evaluate our key components on existing clustering and retrieval datasets to verify the generalization ability. Experimental results manifest that our method can achieve promising results and outperform several state-of-the-art approaches.

Viaarxiv icon

Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion based Classification

Aug 23, 2023
Chengguo Yuan, Yu Jin, Zongzhen Wu, Fanting Wei, Yangzirui Wang, Lan Chen, Xiao Wang

Recognizing target objects using an event-based camera draws more and more attention in recent years. Existing works usually represent the event streams into point-cloud, voxel, image, etc, and learn the feature representations using various deep neural networks. Their final results may be limited by the following factors: monotonous modal expressions and the design of the network structure. To address the aforementioned challenges, this paper proposes a novel dual-stream framework for event representation, extraction, and fusion. This framework simultaneously models two common representations: event images and event voxels. By utilizing Transformer and Structured Graph Neural Network (GNN) architectures, spatial information and three-dimensional stereo information can be learned separately. Additionally, a bottleneck Transformer is introduced to facilitate the fusion of the dual-stream information. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on two widely used event-based classification datasets. The source code of this work is available at: \url{}

* Accepted by PRCV-2023 
Viaarxiv icon

High-performance Data Management for Whole Slide Image Analysis in Digital Pathology

Aug 20, 2023
Haoju Leng, Ruining Deng, Shunxing Bao, Dazheng Fang, Bryan A. Millis, Yucheng Tang, Haichun Yang, Xiao Wang, Yifan Peng, Lipeng Wan, Yuankai Huo

When dealing with giga-pixel digital pathology in whole-slide imaging, a notable proportion of data records holds relevance during each analysis operation. For instance, when deploying an image analysis algorithm on whole-slide images (WSI), the computational bottleneck often lies in the input-output (I/O) system. This is particularly notable as patch-level processing introduces a considerable I/O load onto the computer system. However, this data management process could be further paralleled, given the typical independence of patch-level image processes across different patches. This paper details our endeavors in tackling this data access challenge by implementing the Adaptable IO System version 2 (ADIOS2). Our focus has been constructing and releasing a digital pathology-centric pipeline using ADIOS2, which facilitates streamlined data management across WSIs. Additionally, we've developed strategies aimed at curtailing data retrieval times. The performance evaluation encompasses two key scenarios: (1) a pure CPU-based image analysis scenario ("CPU scenario"), and (2) a GPU-based deep learning framework scenario ("GPU scenario"). Our findings reveal noteworthy outcomes. Under the CPU scenario, ADIOS2 showcases an impressive two-fold speed-up compared to the brute-force approach. In the GPU scenario, its performance stands on par with the cutting-edge GPU I/O acceleration framework, NVIDIA Magnum IO GPU Direct Storage (GDS). From what we know, this appears to be among the initial instances, if any, of utilizing ADIOS2 within the field of digital pathology. The source code has been made publicly available at

Viaarxiv icon

Temporal Sentence Grounding in Streaming Videos

Aug 14, 2023
Tian Gan, Xiao Wang, Yan Sun, Jianlong Wu, Qingpei Guo, Liqiang Nie

Figure 1 for Temporal Sentence Grounding in Streaming Videos
Figure 2 for Temporal Sentence Grounding in Streaming Videos
Figure 3 for Temporal Sentence Grounding in Streaming Videos
Figure 4 for Temporal Sentence Grounding in Streaming Videos

This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query. Unlike regular videos, streaming videos are acquired continuously from a particular source, and are always desired to be processed on-the-fly in many applications such as surveillance and live-stream analysis. Thus, TSGSV is challenging since it requires the model to infer without future frames and process long historical frames effectively, which is untouched in the early methods. To specifically address the above challenges, we propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query. We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.

* Accepted by ACM MM 2023 
Viaarxiv icon

SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition

Aug 08, 2023
Xiao Wang, Zongzhen Wu, Yao Rong, Lin Zhu, Bo Jiang, Jin Tang, Yonghong Tian

Figure 1 for SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition
Figure 2 for SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition
Figure 3 for SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition
Figure 4 for SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition

Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may be still limited due to the following two issues. Firstly, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Secondly, they adopt either Spiking Neural Networks (SNN) for energy-efficient recognition with suboptimal results, or Artificial Neural Networks (ANN) for energy-intensive, high-performance recognition. However, seldom of them consider achieving a balance between these two aspects. In this paper, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multi-modal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-Event based classification datasets fully validated the effectiveness of our proposed framework. We hope this work will boost the development of pattern recognition by fusing RGB frames and event streams. Both our dataset and source code of this work will be released at

* In Peer Review 
Viaarxiv icon

Generative Query Reformulation for Effective Adhoc Search

Aug 01, 2023
Xiao Wang, Sean MacAvaney, Craig Macdonald, Iadh Ounis

Performing automatic reformulations of a user's query is a popular paradigm used in information retrieval (IR) for improving effectiveness -- as exemplified by the pseudo-relevance feedback approaches, which expand the query in order to alleviate the vocabulary mismatch problem. Recent advancements in generative language models have demonstrated their ability in generating responses that are relevant to a given prompt. In light of this success, we seek to study the capacity of such models to perform query reformulation and how they compare with long-standing query reformulation methods that use pseudo-relevance feedback. In particular, we investigate two representative query reformulation frameworks, GenQR and GenPRF. GenQR directly reformulates the user's input query, while GenPRF provides additional context for the query by making use of pseudo-relevance feedback information. For each reformulation method, we leverage different techniques, including fine-tuning and direct prompting, to harness the knowledge of language models. The reformulated queries produced by the generative models are demonstrated to markedly benefit the effectiveness of a state-of-the-art retrieval pipeline on four TREC test collections (varying from TREC 2004 Robust to the TREC 2019 Deep Learning). Furthermore, our results indicate that our studied generative models can outperform various statistical query expansion approaches while remaining comparable to other existing complex neural query reformulation models, with the added benefit of being simpler to implement.

* Accepted to Gen-IR@SIGIR2023 Workshop 
Viaarxiv icon