Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chen Zhang

SenseTime Research

Training Interactive Agent in Large FPS Game Map with Rule-enhanced Reinforcement Learning

Oct 07, 2024

Chen Zhang, Huan Hu, Yuan Zhou, Qiyang Cao, Ruochen Liu, Wenya Wei, Elvis S. Liu

Abstract:In the realm of competitive gaming, 3D first-person shooter (FPS) games have gained immense popularity, prompting the development of game AI systems to enhance gameplay. However, deploying game AI in practical scenarios still poses challenges, particularly in large-scale and complex FPS games. In this paper, we focus on the practical deployment of game AI in the online multiplayer competitive 3D FPS game called Arena Breakout, developed by Tencent Games. We propose a novel gaming AI system named Private Military Company Agent (PMCA), which is interactable within a large game map and engages in combat with players while utilizing tactical advantages provided by the surrounding terrain. To address the challenges of navigation and combat in modern 3D FPS games, we introduce a method that combines navigation mesh (Navmesh) and shooting-rule with deep reinforcement learning (NSRL). The integration of Navmesh enhances the agent's global navigation capabilities while shooting behavior is controlled using rule-based methods to ensure controllability. NSRL employs a DRL model to predict when to enable the navigation mesh, resulting in a diverse range of behaviors for the game AI. Customized rewards for human-like behaviors are also employed to align PMCA's behavior with that of human players.

Via

Access Paper or Ask Questions

EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

Sep 27, 2024

Haoyu Wang, Chunyu Qiang, Tianrui Wang, Cheng Gong, Qiuyu Liu, Yu Jiang, Xiaobao Wang, Chenyang Wang, Chen Zhang

Figure 1 for EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

Figure 2 for EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

Figure 3 for EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

Figure 4 for EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

Abstract:Recent advancements in speech synthesis models, trained on extensive datasets, have demonstrated remarkable zero-shot capabilities. These models can control content, timbre, and emotion in generated speech based on prompt inputs. Despite these advancements, the choice of prompts significantly impacts the output quality, yet most existing selection schemes do not adequately address the control of emotional intensity. To address this question, this paper proposes a two-stage prompt selection strategy EmoPro, which is specifically designed for emotionally controllable speech synthesis. This strategy focuses on selecting highly expressive and high-quality prompts by evaluating them from four perspectives: emotional expression strength, speech quality, text-emotion consistency, and model generation performance. Experimental results show that prompts selected using the proposed method result in more emotionally expressive and engaging synthesized speech compared to those obtained through baseline. Audio samples and codes will be available at https://whyrrrrun.github.io/EmoPro/.

Via

Access Paper or Ask Questions

Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Sep 27, 2024

Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D'Haro, Robby T. Tan, Haizhou Li

Figure 1 for Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Figure 2 for Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Figure 3 for Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Figure 4 for Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Abstract:Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.

* EMNLP24 Findings

Via

Access Paper or Ask Questions

Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification

Sep 24, 2024

Fengrun Zhang, Wangjin Zhou, Yiming Liu, Wang Geng, Yahui Shan, Chen Zhang

Figure 1 for Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification

Figure 2 for Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification

Figure 3 for Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification

Figure 4 for Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification

Abstract:There has been an increasing research interest in cross-age speaker verification~(CASV). However, existing speaker verification systems perform poorly in CASV due to the great individual differences in voice caused by aging. In this paper, we propose a disentangled representation learning framework for CASV based on mutual information~(MI) minimization. In our method, a backbone model is trained to disentangle the identity- and age-related embeddings from speaker information, and an MI estimator is trained to minimize the correlation between age- and identity-related embeddings via MI minimization, resulting in age-invariant speaker embeddings. Furthermore, by using the age gaps between positive and negative samples, we propose an aging-aware MI minimization loss function that allows the backbone model to focus more on the vocal changes with large age gaps. Experimental results show that the proposed method outperforms other methods on multiple Cross-Age test sets of Vox-CA.

* Interspeech 2024

Via

Access Paper or Ask Questions

Aligning Language Models Using Follow-up Likelihood as Reward Signal

Sep 20, 2024

Chen Zhang, Dading Chong, Feng Jiang, Chengguang Tang, Anningzhe Gao, Guohua Tang, Haizhou Li

Figure 1 for Aligning Language Models Using Follow-up Likelihood as Reward Signal

Figure 2 for Aligning Language Models Using Follow-up Likelihood as Reward Signal

Figure 3 for Aligning Language Models Using Follow-up Likelihood as Reward Signal

Figure 4 for Aligning Language Models Using Follow-up Likelihood as Reward Signal

Abstract:In natural human-to-human conversations, participants often receive feedback signals from one another based on their follow-up reactions. These reactions can include verbal responses, facial expressions, changes in emotional state, and other non-verbal cues. Similarly, in human-machine interactions, the machine can leverage the user's follow-up utterances as feedback signals to assess whether it has appropriately addressed the user's request. Therefore, we propose using the likelihood of follow-up utterances as rewards to differentiate preferred responses from less favored ones, without relying on human or commercial LLM-based preference annotations. Our proposed reward mechanism, ``Follow-up Likelihood as Reward" (FLR), matches the performance of strong reward models trained on large-scale human or GPT-4 annotated data on 8 pairwise-preference and 4 rating-based benchmarks. Building upon the FLR mechanism, we propose to automatically mine preference data from the online generations of a base policy model. The preference data are subsequently used to boost the helpfulness of the base model through direct alignment from preference (DAP) methods, such as direct preference optimization (DPO). Lastly, we demonstrate that fine-tuning the language model that provides follow-up likelihood with natural language feedback significantly enhances FLR's performance on reward modeling benchmarks and effectiveness in aligning the base policy model's helpfulness.

* 16 pages, reward model, LLM Alignment

Via

Access Paper or Ask Questions

Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

Sep 16, 2024

Xiaoxue Gao, Chen Zhang, Yiming Chen, Huayun Zhang, Nancy F. Chen

Figure 1 for Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

Figure 2 for Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

Figure 3 for Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

Figure 4 for Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

Abstract:Current emotional text-to-speech (TTS) models predominantly conduct supervised training to learn the conversion from text and desired emotion to its emotional speech, focusing on a single emotion per text-speech pair. These models only learn the correct emotional outputs without fully comprehending other emotion characteristics, which limits their capabilities of capturing the nuances between different emotions. We propose a controllable Emo-DPO approach, which employs direct preference optimization to differentiate subtle emotional nuances between emotions through optimizing towards preferred emotions over less preferred emotional ones. Instead of relying on traditional neural architectures used in existing emotional TTS models, we propose utilizing the emotion-aware LLM-TTS neural architecture to leverage LLMs' in-context learning and instruction-following capabilities. Comprehensive experiments confirm that our proposed method outperforms the existing baselines.

* 5 pages

Via

Access Paper or Ask Questions

A Compressive Memory-based Retrieval Approach for Event Argument Extraction

Sep 14, 2024

Wanlong Liu, Enqi Zhang, Li Zhou, Dingyi Zeng, Shaohuan Cheng, Chen Zhang, Malu Zhang, Wenyu Chen

Figure 1 for A Compressive Memory-based Retrieval Approach for Event Argument Extraction

Figure 2 for A Compressive Memory-based Retrieval Approach for Event Argument Extraction

Figure 3 for A Compressive Memory-based Retrieval Approach for Event Argument Extraction

Figure 4 for A Compressive Memory-based Retrieval Approach for Event Argument Extraction

Abstract:Recent works have demonstrated the effectiveness of retrieval augmentation in the Event Argument Extraction (EAE) task. However, existing retrieval-based EAE methods have two main limitations: (1) input length constraints and (2) the gap between the retriever and the inference model. These issues limit the diversity and quality of the retrieved information. In this paper, we propose a Compressive Memory-based Retrieval (CMR) mechanism for EAE, which addresses the two limitations mentioned above. Our compressive memory, designed as a dynamic matrix that effectively caches retrieved information and supports continuous updates, overcomes the limitations of the input length. Additionally, after pre-loading all candidate demonstrations into the compressive memory, the model further retrieves and filters relevant information from memory based on the input query, bridging the gap between the retriever and the inference model. Extensive experiments show that our method achieves new state-of-the-art performance on three public datasets (RAMS, WikiEvents, ACE05), significantly outperforming existing retrieval-based EAE methods.

* 15 pages

Via

Access Paper or Ask Questions

Half-VAE: An Encoder-Free VAE to Bypass Explicit Inverse Mapping

Sep 06, 2024

Yuan-Hao Wei, Yan-Jie Sun, Chen Zhang

Figure 1 for Half-VAE: An Encoder-Free VAE to Bypass Explicit Inverse Mapping

Figure 2 for Half-VAE: An Encoder-Free VAE to Bypass Explicit Inverse Mapping

Figure 3 for Half-VAE: An Encoder-Free VAE to Bypass Explicit Inverse Mapping

Figure 4 for Half-VAE: An Encoder-Free VAE to Bypass Explicit Inverse Mapping

Abstract:Inference and inverse problems are closely related concepts, both fundamentally involving the deduction of unknown causes or parameters from observed data. Bayesian inference, a powerful class of methods, is often employed to solve a variety of problems, including those related to causal inference. Variational inference, a subset of Bayesian inference, is primarily used to efficiently approximate complex posterior distributions. Variational Autoencoders (VAEs), which combine variational inference with deep learning, have become widely applied across various domains. This study explores the potential of VAEs for solving inverse problems, such as Independent Component Analysis (ICA), without relying on an explicit inverse mapping process. Unlike other VAE-based ICA methods, this approach discards the encoder in the VAE architecture, directly setting the latent variables as trainable parameters. In other words, the latent variables are no longer outputs of the encoder but are instead optimized directly through the objective function to converge to appropriate values. We find that, with a suitable prior setup, the latent variables, represented by trainable parameters, can exhibit mutually independent properties as the parameters converge, all without the need for an encoding process. This approach, referred to as the Half-VAE, bypasses the inverse mapping process by eliminating the encoder. This study demonstrates the feasibility of using the Half-VAE to solve ICA without the need for an explicit inverse mapping process.

Via

Access Paper or Ask Questions

Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models

Sep 05, 2024

Jie Ma, Zhitao Gao, Qi Chai, Wangchun Sun, Pinghui Wang, Hongbin Pei, Jing Tao, Lingyun Song, Jun Liu, Chen Zhang(+1 more)

Figure 1 for Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models

Abstract:Large Language Models (LLMs) may suffer from hallucinations in real-world applications due to the lack of relevant knowledge. In contrast, knowledge graphs encompass extensive, multi-relational structures that store a vast array of symbolic facts. Consequently, integrating LLMs with knowledge graphs has been extensively explored, with Knowledge Graph Question Answering (KGQA) serving as a critical touchstone for the integration. This task requires LLMs to answer natural language questions by retrieving relevant triples from knowledge graphs. However, existing methods face two significant challenges: \textit{excessively long reasoning paths distracting from the answer generation}, and \textit{false-positive relations hindering the path refinement}. In this paper, we propose an iterative interactive KGQA framework that leverages the interactive learning capabilities of LLMs to perform reasoning and Debating over Graphs (DoG). Specifically, DoG employs a subgraph-focusing mechanism, allowing LLMs to perform answer trying after each reasoning step, thereby mitigating the impact of lengthy reasoning paths. On the other hand, DoG utilizes a multi-role debate team to gradually simplify complex questions, reducing the influence of false-positive relations. This debate mechanism ensures the reliability of the reasoning process. Experimental results on five public datasets demonstrate the effectiveness and superiority of our architecture. Notably, DoG outperforms the state-of-the-art method ToG by 23.7\% and 9.1\% in accuracy on WebQuestions and GrailQA, respectively. Furthermore, the integration experiments with various LLMs on the mentioned datasets highlight the flexibility of DoG. Code is available at \url{https://github.com/reml-group/DoG}.

* 12 pages

Via

Access Paper or Ask Questions

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Sep 04, 2024

Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang

Figure 1 for LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Figure 2 for LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Figure 3 for LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Figure 4 for LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Abstract:Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textit{degraded performance with more images} and \textit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.

* 19 pages, 7 figures, 6 tables

Via

Access Paper or Ask Questions