Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kaidi Wang

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

Jun 08, 2026

Wenhao Guan, Yifan Duan, Junxi Liu, Yu Gu, Feng Dang, Kaidi Wang, Qingyang Hong, Lin Li, Xie Chen

Abstract:Video dubbing is a cornerstone of multimedia content creation, aiming to synthesize synchronized acoustic sequences for visual streams. While Text-to-Speech (TTS) and Text-to-Audio (TTA) generation have each achieved remarkable progress, existing dubbing systems remain confined to isolated speech synthesis without incorporating sound effects and ambient audio, forcing practitioners to rely on fragmented workflows and laborious manual post-mixing. To address this limitation, we present HoliDubber, a holistic video dubbing framework that moves beyond speech-only generation by enabling the joint synthesis of speech and sound effects from a single text prompt. Specifically, HoliDubber adopts a patch-based autoregressive diffusion transformer architecture, where a causal language model autoregressively models aggregated patch embeddings to capture global temporal structure, and a Diffusion Transformer decoder generates high-fidelity continuous tokens within each patch, following a divide-and-conquer strategy. To achieve cross-modal alignment, visual features are encoded into patch-level representations and fused with audio patches via cross-attention, enabling the model to ground speech generation in the speaker's visual articulation dynamics. In addition, we introduce HoliDub-Bench, a benchmark curated from established datasets with synchronized video-text-audio triplets designed for holistic dubbing evaluation. Extensive experiments demonstrate that HoliDubber significantly outperforms existing methods across multiple benchmarks in speech quality, synchronization, and speaker similarity. Furthermore, results on HoliDub-Bench validate the effectiveness of joint speech-and-sound generation, establishing a new paradigm for holistic video dubbing in complex acoustic scenes. \footnote{The demo page of the project is https://holidubber.github.io}

Via

Access Paper or Ask Questions

Energy Efficiency Maximization for Discrete Activation based NOMA-assisted Pinching-Antenna Systems

Apr 27, 2026

Yishi Zhang, Aditya Powari, Kaidi Wang, Yaru Fu, Daniel K. C. So

Abstract:Pinching-antenna systems (PASS) have recently attracted significant attention as a promising architecture for flexible and reconfigurable wireless communications. Despite notable advancements, research on energy efficiency (EE) maximization for PASS is limited as existing studies mainly focus on transmit power minimization or utilizing a simple power consumption model. This paper evaluates the impact of pinching antenna (PA) activation power on EE maximization in a downlink NOMA-assisted PASS by jointly optimizing PA activation and user power allocation under quality-of-service and transmit power constraints. To tackle the resulting mixed-integer nonlinear programming problem, we develop a two-layer iterative algorithm, where the outer layer performs matching-based PA selection and the inner layer computes a closed-form optimal power allocation solution. Numerical results demonstrate that the proposed solution achieves substantial EE gains over conventional fixed antennas systems and the considered benchmark schemes, approaches the exhaustive-search upper bound with significantly reduced complexity, while exhibiting fast convergence. It also demonstrates the significance of accounting for PA activation power in EE maximization problem.

* 5 pages, 4 figures. Submitted to IEEE Wireless Communications Letters

Via

Access Paper or Ask Questions

Leaky-Coaxial Pinching-Antenna System with Adjustable Slot Apertures

Apr 25, 2026

Kaidi Wang, Daniel K. C. So, Zhiguo Ding, George K. Karagiannidis

Abstract:As a practical physical implementation of pinching-antenna systems, leaky coaxial cable (LCX) enables distributed radiation in more general wireless environments, particularly for lower-frequency applications. In this paper, a leaky-coaxial pinching-antenna system, referred to as the LCX pinching-antenna system, is investigated, and adjustable slot apertures are introduced, such that the slot size can be continuously adjusted rather than being restricted to binary activation. Specifically, the aperture adjustment is modeled as amplitude scaling of the channels induced by the corresponding slots, or equivalently, as power coefficients associated with different slots. Accordingly, analytical results are derived to quantify the performance gain of continuous aperture adjustment over binary slot activation and to reveal the impact of channel coherence on the achievable data rate improvement. Furthermore, static and dynamic time-division multiple access (TDMA) schemes are considered, and the corresponding sum rate maximization problems are formulated and efficiently solved by quadratic transform based optimization, combined with successive convex approximation and alternating updates. Simulation results demonstrate that the proposed design can significantly outperform conventional fixed-antenna systems, traditional LCX schemes, and binary slot activation in terms of both achievable sum rate and outage probability.

Via

Access Paper or Ask Questions

Leaky Coaxial Cable based Generalized Pinching-Antenna Systems with Dual-Port Feeding

Feb 25, 2026

Kaidi Wang, Zhiguo Ding, Daniel K. C. So

Abstract:By leveraging the distributed leakage radiation of leaky coaxial cables (LCXs), the concept of pinching antennas can be generalized from the conventional high-frequency waveguide based architectures to cable based structures in lower-frequency scenarios. This paper investigates an LCX based generalized pinching-antenna system with dual-port feeding. By enabling bidirectional excitation along each cable, the proposed design significantly enhances spatial degrees of freedom. A comprehensive channel model is developed to characterize intra-cable attenuation, bidirectional phase progression, slot based radiation, and wireless propagation. Based on this model, both analog and hybrid beamforming frameworks are studied with the objective of maximizing the minimum achievable data rate. For analog transmission, slot activation, port selection, and power allocation are jointly optimized using matching theory, coalitional games, and bisection based power control. For hybrid transmission, zero-forcing (ZF) digital precoding is incorporated to eliminate inter-user interference, thereby simplifying slot activation and enabling closed-form optimal power allocation. Simulation results demonstrate that dual-port feeding provides notable performance gains over single-port LCX systems and fixed-antenna benchmarks, validating the effectiveness of the proposed beamforming and resource allocation designs under various transmit power levels and cable parameters.

Via

Access Paper or Ask Questions

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Oct 06, 2025

Wenhao Guan, Zhikang Niu, Ziyue Jiang, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li, Xie Chen

Figure 1 for UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Figure 2 for UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Figure 3 for UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Figure 4 for UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Abstract:Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation.

Via

Access Paper or Ask Questions

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

May 30, 2025

Kaidi Wang, Wenhao Guan, Ziyue Jiang, Hukai Huang, Peijie Chen, Weijie Wu, Qingyang Hong, Lin Li

Abstract:Currently, zero-shot voice conversion systems are capable of synthesizing the voice of unseen speakers. However, most existing approaches struggle to accurately replicate the speaking style of the source speaker or mimic the distinctive speaking style of the target speaker, thereby limiting the controllability of voice conversion. In this work, we propose Discl-VC, a novel voice conversion framework that disentangles content and prosody information from self-supervised speech representations and synthesizes the target speaker's voice through in-context learning with a flow matching transformer. To enable precise control over the prosody of generated speech, we introduce a mask generative transformer that predicts discrete prosody tokens in a non-autoregressive manner based on prompts. Experimental results demonstrate the superior performance of Discl-VC in zero-shot voice conversion and its remarkable accuracy in prosody control for synthesized speech.

Via

Access Paper or Ask Questions

DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec

May 30, 2025

Peijie Chen, Wenhao Guan, Kaidi Wang, Weijie Wu, Hukai Huang, Qingyang Hong, Lin Li

Abstract:Neural speech codecs are essential for advancing text-to-speech (TTS) systems. With the recent success of large language models in text generation, developing high-quality speech tokenizers has become increasingly important. This paper introduces DS-Codec, a novel neural speech codec featuring a dual-stage training framework with mirror and non-mirror architectures switching, designed to achieve superior speech reconstruction. We conduct extensive experiments and ablation studies to evaluate the effectiveness of our training strategy and compare the performance of the two architectures. Our results show that the mirrored structure significantly enhances the robustness of the learned codebooks, and the training strategy balances the advantages between mirrored and non-mirrored structures, leading to improved high-fidelity speech reconstruction.

* Accepted to Interspeech 2025

Via

Access Paper or Ask Questions

Antenna Activation and Resource Allocation in Multi-Waveguide Pinching-Antenna Systems

May 03, 2025

Kaidi Wang, Zhiguo Ding, George K. Karagiannidis

Abstract:Pinching antennas, as a novel flexible-antenna technology capable of establishing line of sight (LoS) connections and effectively mitigating large-scale path loss, have recently attracted considerable research interests. However, the implementation of ideal pinching-antenna systems involves determining and adjusting pinching antennas to an arbitrary position on waveguides, which presents challenges to both practical deployment and related optimization. This paper investigates a practical pinching-antennas system in multi-waveguide scenarios, where pinching antennas are installed at pre-configured discrete positions to serve downlink users with non-orthogonal multiple access (NOMA). To improve system throughput, a sophisticated optimization problem is formulated by jointly considering waveguide assignment, antenna activation, successive interference cancellation (SIC) decoding order design, and power allocation. By treating waveguide assignment and antenna activation as two coalition-formation games, a novel game-theoretic algorithm is developed, in which the optimal decoding order is derived and incorporated. For power allocation, monotonic optimization and successive convex approximation (SCA) are employed to construct global optimal and low-complexity solutions, respectively. Simulation results demonstrate that the NOMA-based pinching-antenna system exhibits superior performance compared to the considered benchmark systems, and the proposed solutions provide significant improvement in terms of sum rate and outage probability.

Via

Access Paper or Ask Questions

SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow

Apr 10, 2025

Kaidi Wang, Wenhao Guan, Shenghui Lu, Jianglong Yao, Lin Li, Qingyang Hong

Abstract:Recently, flow matching based speech synthesis has significantly enhanced the quality of synthesized speech while reducing the number of inference steps. In this paper, we introduce SlimSpeech, a lightweight and efficient speech synthesis system based on rectified flow. We have built upon the existing speech synthesis method utilizing the rectified flow model, modifying its structure to reduce parameters and serve as a teacher model. By refining the reflow operation, we directly derive a smaller model with a more straight sampling trajectory from the larger model, while utilizing distillation techniques to further enhance the model performance. Experimental results demonstrate that our proposed method, with significantly reduced model parameters, achieves comparable performance to larger models through one-step sampling.

Via

Access Paper or Ask Questions

Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework

Jan 16, 2025

Yushen Lin, Ruichen Zhang, Wenqi Huang, Kaidi Wang, Zhiguo Ding, Daniel K. C. So, Dusit Niyato

Abstract:In this work, we develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) specifically for wireless communication applications. The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard. By utilizing advanced language models for entity extraction and question generation, rigorous data curation processes are employed to maintain high quality and relevance. Additionally, we introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data with 2.24\% and 1.31\% performance boost for different models compared to baselines, respectively. To demonstrate the effectiveness of the fine-tuned models with the proposed methodologies on practical tasks, we also consider different tasks, including summarizing optimization problems from technical papers and solving the mathematical problems related to non-orthogonal multiple access (NOMA), which are generated by using the proposed multi-agent framework. Simulation results show significant performance gain in summarization tasks with 20.9\% in the ROUGE-L metrics. We also study the scaling laws of fine-tuning LLMs and the challenges LLMs face in the field of wireless communications, offering insights into their adaptation to wireless communication tasks. This dataset and fine-tuning methodology aim to enhance the training and evaluation of LLMs, contributing to advancements in LLMs for wireless communication research and applications.

* 13 pages, 13 figure, journal

Via

Access Paper or Ask Questions