Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chen Zhang

SenseTime Research

A Survey on Multi-Turn Interaction Capabilities of Large Language Models

Jan 17, 2025

Chen Zhang, Xinyi Dai, Yaxiong Wu, Qu Yang, Yasheng Wang, Ruiming Tang, Yong Liu

Abstract:Multi-turn interaction in the dialogue system research refers to a system's ability to maintain context across multiple dialogue turns, enabling it to generate coherent and contextually relevant responses. Recent advancements in large language models (LLMs) have significantly expanded the scope of multi-turn interaction, moving beyond chatbots to enable more dynamic agentic interactions with users or environments. In this paper, we provide a focused review of the multi-turn capabilities of LLMs, which are critical for a wide range of downstream applications, including conversational search and recommendation, consultation services, and interactive tutoring. This survey explores four key aspects: (1) the core model capabilities that contribute to effective multi-turn interaction, (2) how multi-turn interaction is evaluated in current practice, (3) the general algorithms used to enhance multi-turn interaction, and (4) potential future directions for research in this field.

* Draft Version, 14 pages, Ongoing refinement over time

Via

Access Paper or Ask Questions

Data and System Perspectives of Sustainable Artificial Intelligence

Jan 13, 2025

Tao Xie, David Harel, Dezhi Ran, Zhenwen Li, Maoliang Li, Zhi Yang, Leye Wang, Xiang Chen, Ying Zhang, Wentao Zhang(+4 more)

Abstract:Sustainable AI is a subfield of AI for concerning developing and using AI systems in ways of aiming to reduce environmental impact and achieve sustainability. Sustainable AI is increasingly important given that training of and inference with AI models such as large langrage models are consuming a large amount of computing power. In this article, we discuss current issues, opportunities and example solutions for addressing these issues, and future challenges to tackle, from the data and system perspectives, related to data acquisition, data processing, and AI model training and inference.

Via

Access Paper or Ask Questions

Dialogue Language Model with Large-Scale Persona Data Engineering

Dec 12, 2024

Mengze Hong, Chen Zhang, Chaotao Chen, Rongzhong Lian, Di Jiang

Figure 1 for Dialogue Language Model with Large-Scale Persona Data Engineering

Figure 2 for Dialogue Language Model with Large-Scale Persona Data Engineering

Figure 3 for Dialogue Language Model with Large-Scale Persona Data Engineering

Figure 4 for Dialogue Language Model with Large-Scale Persona Data Engineering

Abstract:Maintaining persona consistency is paramount in the application of open-domain dialogue systems, as exemplified by models like ChatGPT. Despite significant advancements, the limited scale and diversity of current persona dialogue datasets remain challenges to achieving robust persona-consistent dialogue models. In this study, drawing inspiration from the success of large-scale pre-training, we introduce PPDS, an open-domain persona dialogue system that employs extensive generative pre-training on a persona dialogue dataset to enhance persona consistency. Specifically, we present a persona extraction model designed to autonomously and precisely generate vast persona dialogue datasets. Additionally, we unveil a pioneering persona augmentation technique to address the invalid persona bias inherent in the constructed dataset. Both quantitative and human evaluations consistently highlight the superior response quality and persona consistency of our proposed model, underscoring its effectiveness.

Via

Access Paper or Ask Questions

ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty

Dec 12, 2024

Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang

Figure 1 for ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty

Figure 2 for ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty

Figure 3 for ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty

Figure 4 for ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty

Abstract:Large Language models (LLMs) have become a research hotspot. To accelerate the inference of LLMs, storing computed caches in memory has become the standard technique. However, as the inference length increases, growing KV caches might lead to out-of-memory issues. Many existing methods address this issue through KV cache compression, primarily by preserving key tokens throughout all layers to reduce information loss. Most of them allocate a uniform budget size for each layer to retain. However, we observe that the minimum budget sizes needed to retain essential information vary across layers and models based on the perspectives of attention and hidden state output. Building on this observation, this paper proposes a simple yet effective KV cache compression method that leverages layer uncertainty to allocate budget size for each layer. Experimental results show that the proposed method can reduce memory usage of the KV caches to only $\sim$20\% when compared to Full KV inference while achieving nearly lossless performance.

Via

Access Paper or Ask Questions

Technical Report for SoccerNet Challenge 2022 -- Replay Grounding Task

Oct 31, 2024

Shimin Chen, Wei Li, Jiaming Chu, Chen Chen, Chen Zhang, Yandong Guo

Figure 1 for Technical Report for SoccerNet Challenge 2022 -- Replay Grounding Task

Figure 2 for Technical Report for SoccerNet Challenge 2022 -- Replay Grounding Task

Figure 3 for Technical Report for SoccerNet Challenge 2022 -- Replay Grounding Task

Figure 4 for Technical Report for SoccerNet Challenge 2022 -- Replay Grounding Task

Abstract:In order to make full use of video information, we transform the replay grounding problem into a video action location problem. We apply a unified network Faster-TAD proposed by us for temporal action detection to get the results of replay grounding. Finally, by observing the data distribution of the training data, we refine the output of the model to get the final submission.

Via

Access Paper or Ask Questions

FreqMark: Invisible Image Watermarking via Frequency Based Optimization in Latent Space

Oct 28, 2024

Yiyang Guo, Ruizhe Li, Mude Hui, Hanzhong Guo, Chen Zhang, Chuangjian Cai, Le Wan, Shangfei Wang

Abstract:Invisible watermarking is essential for safeguarding digital content, enabling copyright protection and content authentication. However, existing watermarking methods fall short in robustness against regeneration attacks. In this paper, we propose a novel method called FreqMark that involves unconstrained optimization of the image latent frequency space obtained after VAE encoding. Specifically, FreqMark embeds the watermark by optimizing the latent frequency space of the images and then extracts the watermark through a pre-trained image encoder. This optimization allows a flexible trade-off between image quality with watermark robustness and effectively resists regeneration attacks. Experimental results demonstrate that FreqMark offers significant advantages in image quality and robustness, permits flexible selection of the encoding bit number, and achieves a bit accuracy exceeding 90% when encoding a 48-bit hidden message under various attack scenarios.

Via

Access Paper or Ask Questions

Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification

Oct 22, 2024

Yihong Luo, Yuhan Chen, Siya Qiu, Yiwei Wang, Chen Zhang, Yan Zhou, Xiaochun Cao, Jing Tang

Figure 1 for Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification

Figure 2 for Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification

Figure 3 for Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification

Figure 4 for Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification

Abstract:Graph Neural Networks (GNNs) have shown superior performance in node classification. However, GNNs perform poorly in the Few-Shot Node Classification (FSNC) task that requires robust generalization to make accurate predictions for unseen classes with limited labels. To tackle the challenge, we propose the integration of Sharpness-Aware Minimization (SAM)--a technique designed to enhance model generalization by finding a flat minimum of the loss landscape--into GNN training. The standard SAM approach, however, consists of two forward-backward steps in each training iteration, doubling the computational cost compared to the base optimizer (e.g., Adam). To mitigate this drawback, we introduce a novel algorithm, Fast Graph Sharpness-Aware Minimization (FGSAM), that integrates the rapid training of Multi-Layer Perceptrons (MLPs) with the superior performance of GNNs. Specifically, we utilize GNNs for parameter perturbation while employing MLPs to minimize the perturbed loss so that we can find a flat minimum with good generalization more efficiently. Moreover, our method reutilizes the gradient from the perturbation phase to incorporate graph topology into the minimization process at almost zero additional cost. To further enhance training efficiency, we develop FGSAM+ that executes exact perturbations periodically. Extensive experiments demonstrate that our proposed algorithm outperforms the standard SAM with lower computational costs in FSNC tasks. In particular, our FGSAM+ as a SAM variant offers a faster optimization than the base optimizer in most cases. In addition to FSNC, our proposed methods also demonstrate competitive performance in the standard node classification task for heterophilic graphs, highlighting the broad applicability. The code is available at https://github.com/draym28/FGSAM_NeurIPS24.

* NeurIPS24; The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

VoiceBench: Benchmarking LLM-Based Voice Assistants

Oct 22, 2024

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, Haizhou Li

Figure 1 for VoiceBench: Benchmarking LLM-Based Voice Assistants

Figure 2 for VoiceBench: Benchmarking LLM-Based Voice Assistants

Figure 3 for VoiceBench: Benchmarking LLM-Based Voice Assistants

Figure 4 for VoiceBench: Benchmarking LLM-Based Voice Assistants

Abstract:Building on the success of large language models (LLMs), recent advancements such as GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering a significantly improved user experience compared to traditional text-based interactions. However, the absence of benchmarks designed to evaluate these speech interaction capabilities has hindered progress of LLM-based voice assistants development. Current evaluations focus primarily on automatic speech recognition (ASR) or general knowledge evaluation with clean speeches, neglecting the more intricate, real-world scenarios that involve diverse speaker characteristics, environmental and content factors. To address this, we introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.

* Work in progress. Data is available at https://github.com/MatthewCYM/VoiceBench

Via

Access Paper or Ask Questions

MoDification: Mixture of Depths Made Easy

Oct 18, 2024

Chen Zhang, Meizhi Zhong, Qimeng Wang, Xuantao Lu, Zheyu Ye, Chengqiang Lu, Yan Gao, Yao Hu, Kehai Chen, Min Zhang(+1 more)

Figure 1 for MoDification: Mixture of Depths Made Easy

Figure 2 for MoDification: Mixture of Depths Made Easy

Figure 3 for MoDification: Mixture of Depths Made Easy

Figure 4 for MoDification: Mixture of Depths Made Easy

Abstract:Long-context efficiency has recently become a trending topic in serving large language models (LLMs). And mixture of depths (MoD) is proposed as a perfect fit to bring down both latency and memory. In this paper, however, we discover that MoD can barely transform existing LLMs without costly training over an extensive number of tokens. To enable the transformations from any LLMs to MoD ones, we showcase top-k operator in MoD should be promoted to threshold-p operator, and refinement to architecture and data should also be crafted along. All these designs form our method termed MoDification. Through a comprehensive set of experiments covering model scales from 3B to 70B, we exhibit MoDification strikes an excellent balance between efficiency and effectiveness. MoDification can achieve up to ~1.2x speedup in latency and ~1.8x reduction in memory compared to original LLMs especially in long-context applications.

* 12 pages, 9 figures, 5 tables, work in progress

Via

Access Paper or Ask Questions

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

Oct 09, 2024

Zhenhui Ye, Tianyun Zhong, Yi Ren, Ziyue Jiang, Jiawei Huang, Rongjie Huang, Jinglin Liu, Jinzheng He, Chen Zhang, Zehan Wang(+3 more)

Figure 1 for MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

Figure 2 for MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

Figure 3 for MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

Figure 4 for MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

Abstract:Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos. Personalized TFG is a variant that emphasizes the perceptual identity similarity of the synthesized result (from the perspective of appearance and talking style). While previous works typically solve this problem by learning an individual neural radiance field (NeRF) for each identity to implicitly store its static and dynamic information, we find it inefficient and non-generalized due to the per-identity-per-training framework and the limited training data. To this end, we propose MimicTalk, the first attempt that exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG. To be specific, (1) we first come up with a person-agnostic 3D TFG model as the base model and propose to adapt it into a specific identity; (2) we propose a static-dynamic-hybrid adaptation pipeline to help the model learn the personalized static appearance and facial dynamic features; (3) To generate the facial motion of the personalized talking style, we propose an in-context stylized audio-to-motion model that mimics the implicit talking style provided in the reference video without information loss by an explicit style representation. The adaptation process to an unseen identity can be performed in 15 minutes, which is 47 times faster than previous person-dependent methods. Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness. Source code and video samples are available at https://mimictalk.github.io .

* Accepted by NeurIPS 2024

Via

Access Paper or Ask Questions