Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiangmin Xu

University of Glasgow, United Kingdom

Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

Mar 13, 2025

Jiayu Jiang, Changxing Ding, Wentao Tan, Junhong Wang, Jin Tao, Xiangmin Xu

Figure 1 for Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

Figure 2 for Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

Figure 3 for Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

Figure 4 for Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

Abstract:Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images automatically. However, the captions produced by MLLMs lack diversity in description styles. To address this issue, we propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators. Specifically, we first extract style features from human textual descriptions and perform clustering on them. This allows us to group textual descriptions with similar styles into the same cluster. Then, we employ a prompt to represent each of these clusters and apply prompt learning to mimic the description styles of different human annotators. Furthermore, we define a style feature space and perform uniform sampling in this space to obtain more diverse clustering prototypes, which further enriches the diversity of the MLLM-generated captions. Finally, we adopt HAM to automatically annotate a massive-scale database for text-to-image ReID. Extensive experiments on this database demonstrate that it significantly improves the generalization ability of ReID models.

* CVPR 2025. Project website: https://github.com/sssaury/HAM

Via

Access Paper or Ask Questions

HedgeAgents: A Balanced-aware Multi-agent Financial Trading System

Feb 17, 2025

Xiangyu Li, Yawen Zeng, Xiaofen Xing, Jin Xu, Xiangmin Xu

Figure 1 for HedgeAgents: A Balanced-aware Multi-agent Financial Trading System

Figure 2 for HedgeAgents: A Balanced-aware Multi-agent Financial Trading System

Figure 3 for HedgeAgents: A Balanced-aware Multi-agent Financial Trading System

Figure 4 for HedgeAgents: A Balanced-aware Multi-agent Financial Trading System

Abstract:As automated trading gains traction in the financial market, algorithmic investment strategies are increasingly prominent. While Large Language Models (LLMs) and Agent-based models exhibit promising potential in real-time market analysis and trading decisions, they still experience a significant -20% loss when confronted with rapid declines or frequent fluctuations, impeding their practical application. Hence, there is an imperative to explore a more robust and resilient framework. This paper introduces an innovative multi-agent system, HedgeAgents, aimed at bolstering system robustness via ``hedging'' strategies. In this well-balanced system, an array of hedging agents has been tailored, where HedgeAgents consist of a central fund manager and multiple hedging experts specializing in various financial asset classes. These agents leverage LLMs' cognitive capabilities to make decisions and coordinate through three types of conferences. Benefiting from the powerful understanding of LLMs, our HedgeAgents attained a 70% annualized return and a 400% total return over a period of 3 years. Moreover, we have observed with delight that HedgeAgents can even formulate investment experience comparable to those of human experts (https://hedgeagents.github.io/).

* This paper has been accepted by The Web Conference 2025 (WWW 2025) and selected for an oral presentation

Via

Access Paper or Ask Questions

PsyDT: Using LLMs to Construct the Digital Twin of Psychological Counselor with Personalized Counseling Style for Psychological Counseling

Dec 18, 2024

Haojie Xie, Yirong Chen, Xiaofen Xing, Jingkai Lin, Xiangmin Xu

Figure 1 for PsyDT: Using LLMs to Construct the Digital Twin of Psychological Counselor with Personalized Counseling Style for Psychological Counseling

Figure 2 for PsyDT: Using LLMs to Construct the Digital Twin of Psychological Counselor with Personalized Counseling Style for Psychological Counseling

Figure 3 for PsyDT: Using LLMs to Construct the Digital Twin of Psychological Counselor with Personalized Counseling Style for Psychological Counseling

Figure 4 for PsyDT: Using LLMs to Construct the Digital Twin of Psychological Counselor with Personalized Counseling Style for Psychological Counseling

Abstract:Currently, large language models (LLMs) have made significant progress in the field of psychological counseling. However, existing mental health LLMs overlook a critical issue where they do not consider the fact that different psychological counselors exhibit different personal styles, including linguistic style and therapy techniques, etc. As a result, these LLMs fail to satisfy the individual needs of clients who seek different counseling styles. To help bridge this gap, we propose PsyDT, a novel framework using LLMs to construct the Digital Twin of Psychological counselor with personalized counseling style. Compared to the time-consuming and costly approach of collecting a large number of real-world counseling cases to create a specific counselor's digital twin, our framework offers a faster and more cost-effective solution. To construct PsyDT, we utilize dynamic one-shot learning by using GPT-4 to capture counselor's unique counseling style, mainly focusing on linguistic style and therapy techniques. Subsequently, using existing single-turn long-text dialogues with client's questions, GPT-4 is guided to synthesize multi-turn dialogues of specific counselor. Finally, we fine-tune the LLMs on the synthetic dataset, PsyDTCorpus, to achieve the digital twin of psychological counselor with personalized counseling style. Experimental results indicate that our proposed PsyDT framework can synthesize multi-turn dialogues that closely resemble real-world counseling cases and demonstrate better performance compared to other baselines, thereby show that our framework can effectively construct the digital twin of psychological counselor with a specific counseling style.

* 9 pages, 6 figures

Via

Access Paper or Ask Questions

Shared Attention-based Autoencoder with Hierarchical Fusion-based Graph Convolution Network for sEEG SOZ Identification

Dec 17, 2024

Huachao Yan, Kailing Guo, Shiwei Song, Yihai Dai, Xiaoqiang Wei, Xiaofen Xing, Xiangmin Xu

Figure 1 for Shared Attention-based Autoencoder with Hierarchical Fusion-based Graph Convolution Network for sEEG SOZ Identification

Figure 2 for Shared Attention-based Autoencoder with Hierarchical Fusion-based Graph Convolution Network for sEEG SOZ Identification

Figure 3 for Shared Attention-based Autoencoder with Hierarchical Fusion-based Graph Convolution Network for sEEG SOZ Identification

Figure 4 for Shared Attention-based Autoencoder with Hierarchical Fusion-based Graph Convolution Network for sEEG SOZ Identification

Abstract:Diagnosing seizure onset zone (SOZ) is a challenge in neurosurgery, where stereoelectroencephalography (sEEG) serves as a critical technique. In sEEG SOZ identification, the existing studies focus solely on the intra-patient representation of epileptic information, overlooking the general features of epilepsy across patients and feature interdependencies between feature elements in each contact site. In order to address the aforementioned challenges, we propose the shared attention-based autoencoder (sATAE). sATAE is trained by sEEG data across all patients, with attention blocks introduced to enhance the representation of interdependencies between feature elements. Considering the spatial diversity of sEEG across patients, we introduce graph-based method for identification SOZ of each patient. However, the current graph-based methods for sEEG SOZ identification rely exclusively on static graphs to model epileptic networks. Inspired by the finding of neuroscience that epileptic network is intricately characterized by the interplay of sophisticated equilibrium between fluctuating and stable states, we design the hierarchical fusion-based graph convolution network (HFGCN) to identify the SOZ. HFGCN integrates the dynamic and static characteristics of epileptic networks through hierarchical weighting across different hierarchies, facilitating a more comprehensive learning of epileptic features and enriching node information for sEEG SOZ identification. Combining sATAE and HFGCN, we perform comprehensive experiments with sATAE-HFGCN on the self-build sEEG dataset, which includes sEEG data from 17 patients with temporal lobe epilepsy. The results show that our method, sATAE-HFGCN, achieves superior performance for identifying the SOZ of each patient, effectively addressing the aforementioned challenges, providing an efficient solution for sEEG-based SOZ identification.

Via

Access Paper or Ask Questions

SleepCoT: A Lightweight Personalized Sleep Health Model via Chain-of-Thought Distillation

Oct 22, 2024

Huimin Zheng, Xiaofeng Xing, Xiangmin Xu

Figure 1 for SleepCoT: A Lightweight Personalized Sleep Health Model via Chain-of-Thought Distillation

Figure 2 for SleepCoT: A Lightweight Personalized Sleep Health Model via Chain-of-Thought Distillation

Figure 3 for SleepCoT: A Lightweight Personalized Sleep Health Model via Chain-of-Thought Distillation

Figure 4 for SleepCoT: A Lightweight Personalized Sleep Health Model via Chain-of-Thought Distillation

Abstract:We present a novel approach to personalized sleep health management using few-shot Chain-of-Thought (CoT) distillation, enabling small-scale language models (> 2B parameters) to rival the performance of large language models (LLMs) in specialized health domains. Our method simultaneously distills problem-solving strategies, long-tail expert knowledge, and personalized recommendation capabilities from larger models into more efficient, compact models. Unlike existing systems, our approach offers three key functionalities: generating personalized sleep health recommendations, supporting user-specific follow-up inquiries, and providing responses to domain-specific knowledge questions. We focus on sleep health due to its measurability via wearable devices and its impact on overall well-being. Our experimental setup, involving GPT-4o for data synthesis, Qwen-max for instruction set creation, and Qwen2.5 1.5B for model distillation, demonstrates significant improvements over baseline small-scale models in penalization, reasoning, and knowledge application. Experiments using 100 simulated sleep reports and 1,000 domain-specific questions shows our model achieves comparable performance to larger models while maintaining efficiency for real-world deployment. This research not only advances AI-driven health management but also provides a novel approach to leveraging LLM capabilities in resource-constrained environments, potentially enhancing the accessibility of personalized healthcare solutions.

Via

Access Paper or Ask Questions

Task-Oriented Edge-Assisted Cooperative Data Compression, Communications and Computing for UGV-Enhanced Warehouse Logistics

Oct 02, 2024

Jiaming Yang, Zhen Meng, Xiangmin Xu, Kan Chen, Emma Liying Li, Philip Guodong G. Zhao

Figure 1 for Task-Oriented Edge-Assisted Cooperative Data Compression, Communications and Computing for UGV-Enhanced Warehouse Logistics

Figure 2 for Task-Oriented Edge-Assisted Cooperative Data Compression, Communications and Computing for UGV-Enhanced Warehouse Logistics

Figure 3 for Task-Oriented Edge-Assisted Cooperative Data Compression, Communications and Computing for UGV-Enhanced Warehouse Logistics

Figure 4 for Task-Oriented Edge-Assisted Cooperative Data Compression, Communications and Computing for UGV-Enhanced Warehouse Logistics

Abstract:Only the chairs can edit This paper explores the growing need for task-oriented communications in warehouse logistics, where traditional communication Key Performance Indicators (KPIs)-such as latency, reliability, and throughput-often do not fully meet task requirements. As the complexity of data flow management in large-scale device networks increases, there is also a pressing need for innovative cross-system designs that balance data compression, communication, and computation. To address these challenges, we propose a task-oriented, edge-assisted framework for cooperative data compression, communication, and computing in Unmanned Ground Vehicle (UGV)-enhanced warehouse logistics. In this framework, two UGVs collaborate to transport cargo, with control functions-navigation for the front UGV and following/conveyance for the rear UGV-offloaded to the edge server to accommodate their limited on-board computing resources. We develop a Deep Reinforcement Learning (DRL)-based two-stage point cloud data compression algorithm that dynamically and collaboratively adjusts compression ratios according to task requirements, significantly reducing communication overhead. System-level simulations of our UGV logistics prototype demonstrate the framework's effectiveness and its potential for swift real-world implementation.

Via

Access Paper or Ask Questions

Multi-Scale Temporal Transformer For Speech Emotion Recognition

Oct 01, 2024

Zhipeng Li, Xiaofen Xing, Yuanbo Fang, Weibin Zhang, Hengsheng Fan, Xiangmin Xu

Figure 1 for Multi-Scale Temporal Transformer For Speech Emotion Recognition

Figure 2 for Multi-Scale Temporal Transformer For Speech Emotion Recognition

Figure 3 for Multi-Scale Temporal Transformer For Speech Emotion Recognition

Figure 4 for Multi-Scale Temporal Transformer For Speech Emotion Recognition

Abstract:Speech emotion recognition plays a crucial role in human-machine interaction systems. Recently various optimized Transformers have been successfully applied to speech emotion recognition. However, the existing Transformer architectures focus more on global information and require large computation. On the other hand, abundant speech emotional representations exist locally on different parts of the input speech. To tackle these problems, we propose a Multi-Scale TRansfomer (MSTR) for speech emotion recognition. It comprises of three main components: (1) a multi-scale temporal feature operator, (2) a fractal self-attention module, and (3) a scale mixer module. These three components can effectively enhance the transformer's ability to learn multi-scale local emotion representations. Experimental results demonstrate that the proposed MSTR model significantly outperforms a vanilla Transformer and other state-of-the-art methods across three speech emotion datasets: IEMOCAP, MELD and, CREMAD. In addition, it can greatly reduce the computational cost.

Via

Access Paper or Ask Questions

Online Multi-level Contrastive Representation Distillation for Cross-Subject fNIRS Emotion Recognition

Sep 24, 2024

Zhili Lai, Chunmei Qing, Junpeng Tan, Wanxiang Luo, Xiangmin Xu

Abstract:Utilizing functional near-infrared spectroscopy (fNIRS) signals for emotion recognition is a significant advancement in understanding human emotions. However, due to the lack of artificial intelligence data and algorithms in this field, current research faces the following challenges: 1) The portable wearable devices have higher requirements for lightweight models; 2) The objective differences of physiology and psychology among different subjects aggravate the difficulty of emotion recognition. To address these challenges, we propose a novel cross-subject fNIRS emotion recognition method, called the Online Multi-level Contrastive Representation Distillation framework (OMCRD). Specifically, OMCRD is a framework designed for mutual learning among multiple lightweight student networks. It utilizes multi-level fNIRS feature extractor for each sub-network and conducts multi-view sentimental mining using physiological signals. The proposed Inter-Subject Interaction Contrastive Representation (IS-ICR) facilitates knowledge transfer for interactions between student models, enhancing cross-subject emotion recognition performance. The optimal student network can be selected and deployed on a wearable device. Some experimental results demonstrate that OMCRD achieves state-of-the-art results in emotional perception and affective imagery tasks.

* Accepted in ACMMM-2024 Workshop BCI. Codes are available at https://github.com/Lzhili/fNIRS-OMCRD

Via

Access Paper or Ask Questions

AS-Speech: Adaptive Style For Speech Synthesis

Sep 09, 2024

Zhipeng Li, Xiaofen Xing, Jun Wang, Shuaiqi Chen, Guoqiao Yu, Guanglu Wan, Xiangmin Xu

Figure 1 for AS-Speech: Adaptive Style For Speech Synthesis

Figure 2 for AS-Speech: Adaptive Style For Speech Synthesis

Figure 3 for AS-Speech: Adaptive Style For Speech Synthesis

Figure 4 for AS-Speech: Adaptive Style For Speech Synthesis

Abstract:In recent years, there has been significant progress in Text-to-Speech (TTS) synthesis technology, enabling the high-quality synthesis of voices in common scenarios. In unseen situations, adaptive TTS requires a strong generalization capability to speaker style characteristics. However, the existing adaptive methods can only extract and integrate coarse-grained timbre or mixed rhythm attributes separately. In this paper, we propose AS-Speech, an adaptive style methodology that integrates the speaker timbre characteristics and rhythmic attributes into a unified framework for text-to-speech synthesis. Specifically, AS-Speech can accurately simulate style characteristics through fine-grained text-based timbre features and global rhythm information, and achieve high-fidelity speech synthesis through the diffusion model. Experiments show that the proposed model produces voices with higher naturalness and similarity in terms of timbre and rhythm compared to a series of adaptive TTS models.

* Accepted by SLT 2024

Via

Access Paper or Ask Questions

Timeliness-Fidelity Tradeoff in 3D Scene Representations

Jul 23, 2024

Xiangmin Xu, Zhen Meng, Yichi Zhang, Changyang She, Philip G. Zhao

Figure 1 for Timeliness-Fidelity Tradeoff in 3D Scene Representations

Figure 2 for Timeliness-Fidelity Tradeoff in 3D Scene Representations

Figure 3 for Timeliness-Fidelity Tradeoff in 3D Scene Representations

Figure 4 for Timeliness-Fidelity Tradeoff in 3D Scene Representations

Abstract:Real-time three-dimensional (3D) scene representations serve as one of the building blocks that bolster various innovative applications, e.g., digital manufacturing, Virtual/Augmented/Extended/Mixed Reality (VR/AR/XR/MR), and the metaverse. Despite substantial efforts that have been made to real-time communications and computing, real-time 3D scene representations remain a challenging task. This paper investigates the tradeoff between timeliness and fidelity in real-time 3D scene representations. Specifically, we establish a framework to evaluate the impact of communication delay on the tradeoff, where the real-world scenario is monitored by multiple cameras that communicate with an edge server. To improve fidelity for 3D scene representations, we propose to use a single-step Proximal Policy Optimization (PPO) method that leverages the Age of Information (AoI) to decide if the received image needs to be involved in 3D scene representations and rendering. We test our framework and the proposed approach with different well-known 3D scene representation methods. Simulation results reveal that real-time 3D scene representation can be sensitively affected by communication delay, and our proposed method can achieve optimal 3D scene representation results.

* This paper has been accepted for publication by the IEEE International Conference on Computer Communications (INFOCOM) Workshops 2024

Via

Access Paper or Ask Questions