Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Die Zhang

StepAudio 2.5 Technical Report

May 22, 2026

Bin Lin, Bo Zhao, Boyong Wu, Chao Yan, Chen Wu, Cheng Yi, Chengyuan Yao, Daijiao Liu, Fei Tian, Feng Tian(+91 more)

Abstract:Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems across automatic speech recognition (ASR), text-to-speech synthesis (TTS), and realtime spoken interaction. Bridging this gap remains an open challenge. This report presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across all three capabilities. Rather than treating these tasks as architecturally distinct, we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes: data construction, optimization targets, and decoding constraints. Guided by this insight, we advance the post-training paradigm from standard supervised learning to task-tailored Reinforcement Learning from Human Feedback (RLHF), using it as the primary mechanism to define complex optimization targets. We leverage this RLHF-centric alignment, alongside specialized decoding, to shape a shared backbone into three distinct operational modes. Concretely, the ASR branch advances transcription efficiency via verifiable multi-token decoding; the TTS branch achieves controllable, expressive synthesis through preference-based RLHF and context-rich supervision; and the Realtime branch realizes low-latency, persona-consistent dialogue via generative reward modeling within an RLHF framework. On standard benchmarks, StepAudio 2.5 achieves state-of-the-art results across ASR, TTS, and Realtime, demonstrating that a singular audio-language foundation can successfully internalize the distinct deployment objectives of speech understanding, generation, and live interaction.

Via

Access Paper or Ask Questions

A Unified Game-Theoretic Interpretation of Adversarial Robustness

Nov 08, 2021

Jie Ren, Die Zhang, Yisen Wang, Lu Chen, Zhanpeng Zhou, Yiting Chen, Xu Cheng, Xin Wang, Meng Zhou, Jie Shi(+1 more)

Figure 1 for A Unified Game-Theoretic Interpretation of Adversarial Robustness

Figure 2 for A Unified Game-Theoretic Interpretation of Adversarial Robustness

Figure 3 for A Unified Game-Theoretic Interpretation of Adversarial Robustness

Figure 4 for A Unified Game-Theoretic Interpretation of Adversarial Robustness

Abstract:This paper provides a unified view to explain different adversarial attacks and defense methods, \emph{i.e.} the view of multi-order interactions between input variables of DNNs. Based on the multi-order interaction, we discover that adversarial attacks mainly affect high-order interactions to fool the DNN. Furthermore, we find that the robustness of adversarially trained DNNs comes from category-specific low-order interactions. Our findings provide a potential method to unify adversarial perturbations and robustness, which can explain the existing defense methods in a principle way. Besides, our findings also make a revision of previous inaccurate understanding of the shape bias of adversarially learned features.

* the previous version is arXiv:2103.07364, but I mistakenly apply a new ID for the paper

Via

Access Paper or Ask Questions

Game-theoretic Understanding of Adversarially Learned Features

Mar 12, 2021

Jie Ren, Die Zhang, Yisen Wang, Lu Chen, Zhanpeng Zhou, Xu Cheng, Xin Wang, Yiting Chen, Jie Shi, Quanshi Zhang

Figure 1 for Game-theoretic Understanding of Adversarially Learned Features

Figure 2 for Game-theoretic Understanding of Adversarially Learned Features

Figure 3 for Game-theoretic Understanding of Adversarially Learned Features

Figure 4 for Game-theoretic Understanding of Adversarially Learned Features

Abstract:This paper aims to understand adversarial attacks and defense from a new perspecitve, i.e., the signal-processing behavior of DNNs. We novelly define the multi-order interaction in game theory, which satisfies six properties. With the multi-order interaction, we discover that adversarial attacks mainly affect high-order interactions to fool the DNN. Furthermore, we find that the robustness of adversarially trained DNNs comes from category-specific low-order interactions. Our findings provide more insights into and make a revision of previous understanding for the shape bias of adversarially learned features. Besides, the multi-order interaction can also explain the recoverability of adversarial examples.

Via

Access Paper or Ask Questions

Interpreting Multivariate Interactions in DNNs

Oct 15, 2020

Hao Zhang, Yichen Xie, Longjie Zheng, Die Zhang, Quanshi Zhang

Figure 1 for Interpreting Multivariate Interactions in DNNs

Figure 2 for Interpreting Multivariate Interactions in DNNs

Figure 3 for Interpreting Multivariate Interactions in DNNs

Figure 4 for Interpreting Multivariate Interactions in DNNs

Abstract:This paper aims to explain deep neural networks (DNNs) from the perspective of multivariate interactions. In this paper, we define and quantify the significance of interactions among multiple input variables of the DNN. Input variables with strong interactions usually form a coalition and reflect prototype features, which are memorized and used by the DNN for inference. We define the significance of interactions based on the Shapley value, which is designed to assign the attribution value of each input variable to the inference. We have conducted experiments with various DNNs. Experimental results have demonstrated the effectiveness of the proposed method.

Via

Access Paper or Ask Questions

Interpreting Hierarchical Linguistic Interactions in DNNs

Jun 29, 2020

Die Zhang, Huilin Zhou, Xiaoyi Bao, Da Huo, Ruizhao Chen, Xu Cheng, Hao Zhang, Mengyue Wu, Quanshi Zhang

Figure 1 for Interpreting Hierarchical Linguistic Interactions in DNNs

Figure 2 for Interpreting Hierarchical Linguistic Interactions in DNNs

Figure 3 for Interpreting Hierarchical Linguistic Interactions in DNNs

Figure 4 for Interpreting Hierarchical Linguistic Interactions in DNNs

Abstract:This paper proposes a method to disentangle and quantify interactions among words that are encoded inside a DNN for natural language processing. We construct a tree to encode salient interactions extracted by the DNN. Six metrics are proposed to analyze properties of interactions between constituents in a sentence. The interaction is defined based on Shapley values of words, which are considered as an unbiased estimation of word contributions to the network prediction. Our method is used to quantify word interactions encoded inside the BERT, ELMo, LSTM, CNN, and Transformer networks. Experimental results have provided a new perspective to understand these DNNs, and have demonstrated the effectiveness of our method.

Via

Access Paper or Ask Questions