Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pengcheng Zhu

E1 TTS: Simple and Fast Non-Autoregressive TTS

Sep 14, 2024

Zhijun Liu, Shuai Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li

Figure 1 for E1 TTS: Simple and Fast Non-Autoregressive TTS

Figure 2 for E1 TTS: Simple and Fast Non-Autoregressive TTS

Figure 3 for E1 TTS: Simple and Fast Non-Autoregressive TTS

Figure 4 for E1 TTS: Simple and Fast Non-Autoregressive TTS

Abstract:This paper introduces Easy One-Step Text-to-Speech (E1 TTS), an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation. The training of E1 TTS is straightforward; it does not require explicit monotonic alignment between the text and audio pairs. The inference of E1 TTS is efficient, requiring only one neural network evaluation for each utterance. Despite its sampling efficiency, E1 TTS achieves naturalness and speaker similarity comparable to various strong baseline models. Audio samples are available at http://e1tts.github.io/ .

Via

Access Paper or Ask Questions

Mini-Slot-Assisted Short Packet URLLC:Differential or Coherent Detection?

Aug 26, 2024

Canjian Zheng, Fu-Chun Zheng, Jingjing Luo, Pengcheng Zhu, Xiaohu You, Daquan Feng

Figure 1 for Mini-Slot-Assisted Short Packet URLLC:Differential or Coherent Detection?

Figure 2 for Mini-Slot-Assisted Short Packet URLLC:Differential or Coherent Detection?

Figure 3 for Mini-Slot-Assisted Short Packet URLLC:Differential or Coherent Detection?

Figure 4 for Mini-Slot-Assisted Short Packet URLLC:Differential or Coherent Detection?

Abstract:One of the primary challenges in short packet ultra-reliable and low-latency communications (URLLC) is to achieve reliable channel estimation and data detection while minimizing the impact on latency performance. Given the small packet size in mini-slot-assisted URLLC, relying solely on pilot-based coherent detection is almost impossible to meet the seemingly contradictory requirements of high channel estimation accuracy, high reliability, low training overhead, and low latency. In this paper, we explore differential modulation both in the frequency domain and in the time domain, and propose adopting an adaptive approach that integrates both differential and coherent detection to achieve mini-slot-assisted short packet URLLC, striking a balance among training overhead, system performance, and computational complexity. Specifically, differential (especially in the frequency domain) and coherent detection schemes can be dynamically activated based on application scenarios, channel statistics, information payloads, mini-slot deployment options, and service requirements. Furthermore, we derive the block error rate (BLER) for pilot-based, frequency domain, and time domain differential OFDM using non-asymptotic information-theoretic bounds. Simulation results validate the feasibility and effectiveness of adaptive differential and coherent detection.

* 14 pages, 8 figures, journal

Via

Access Paper or Ask Questions

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Jun 12, 2024

Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi

Figure 1 for DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Figure 2 for DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Figure 3 for DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Figure 4 for DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Abstract:Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms. Nonetheless, the recognition-synthesis framework hinders end-to-end optimization, and the instability of automatic speech recognition (ASR) model with short chunks makes it challenging to further reduce latency. To address these issues, we propose an end-to-end model, DualVC 3. With speaker-independent semantic tokens to guide the training of the content encoder, the dependency on ASR is removed and the model can operate under extremely small chunks, with cascading errors eliminated. A language model is trained on the content encoder output to produce pseudo context by iteratively predicting future frames, providing more contextual information for the decoder to improve conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in subjective and objective metrics, with a latency of only 50 ms.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

MGS-SLAM: Monocular Sparse Tracking and Gaussian Mapping with Depth Smooth Regularization

May 10, 2024

Pengcheng Zhu, Yaoming Zhuang, Baoquan Chen, Li Li, Chengdong Wu, Zhanlin Liu

Figure 1 for MGS-SLAM: Monocular Sparse Tracking and Gaussian Mapping with Depth Smooth Regularization

Figure 2 for MGS-SLAM: Monocular Sparse Tracking and Gaussian Mapping with Depth Smooth Regularization

Figure 3 for MGS-SLAM: Monocular Sparse Tracking and Gaussian Mapping with Depth Smooth Regularization

Figure 4 for MGS-SLAM: Monocular Sparse Tracking and Gaussian Mapping with Depth Smooth Regularization

Abstract:This letter introduces a novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting. Recently Gaussian Splatting-based SLAM has yielded promising results, but rely on RGB-D input and is weak in tracking. To address these limitations, we uniquely integrates advanced sparse visual odometry with a dense Gaussian Splatting scene representation for the first time, thereby eliminating the dependency on depth maps typical of Gaussian Splatting-based SLAM systems and enhancing tracking robustness. Here, the sparse visual odometry tracks camera poses in RGB stream, while Gaussian Splatting handles map reconstruction. These components are interconnected through a Multi-View Stereo (MVS) depth estimation network. And we propose a depth smooth loss to reduce the negative effect of estimated depth maps. Furthermore, the consistency in scale between the sparse visual odometry and the dense Gaussian map is preserved by Sparse-Dense Adjustment Ring (SDAR). We have evaluated our system across various synthetic and real-world datasets. The accuracy of our pose estimation surpasses existing methods and achieves state-of-the-art performance. Additionally, it outperforms previous monocular methods in terms of novel view synthesis fidelity, matching the results of neural SLAM systems that utilize RGB-D input.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

Trainable Joint Channel Estimation, Detection and Decoding for MIMO URLLC Systems

Apr 11, 2024

Yi Sun, Hong Shen, Bingqing Li, Wei Xu, Pengcheng Zhu, Nan Hu, Chunming Zhao

Abstract:The receiver design for multi-input multi-output (MIMO) ultra-reliable and low-latency communication (URLLC) systems can be a tough task due to the use of short channel codes and few pilot symbols. Consequently, error propagation can occur in traditional turbo receivers, leading to performance degradation. Moreover, the processing delay induced by information exchange between different modules may also be undesirable for URLLC. To address the issues, we advocate to perform joint channel estimation, detection, and decoding (JCDD) for MIMO URLLC systems encoded by short low-density parity-check (LDPC) codes. Specifically, we develop two novel JCDD problem formulations based on the maximum a posteriori (MAP) criterion for Gaussian MIMO channels and sparse mmWave MIMO channels, respectively, which integrate the pilots, the bit-to-symbol mapping, the LDPC code constraints, as well as the channel statistical information. Both the challenging large-scale non-convex problems are then solved based on the alternating direction method of multipliers (ADMM) algorithms, where closed-form solutions are achieved in each ADMM iteration. Furthermore, two JCDD neural networks, called JCDDNet-G and JCDDNet-S, are built by unfolding the derived ADMM algorithms and introducing trainable parameters. It is interesting to find via simulations that the proposed trainable JCDD receivers can outperform the turbo receivers with affordable computational complexities.

* 17 pages, 12 figures, accepted by IEEE Transactions on Wireless Communications

Via

Access Paper or Ask Questions

Accent-VITS:accent transfer for end-to-end TTS

Dec 29, 2023

Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, Lei Xie

Figure 1 for Accent-VITS:accent transfer for end-to-end TTS

Figure 2 for Accent-VITS:accent transfer for end-to-end TTS

Figure 3 for Accent-VITS:accent transfer for end-to-end TTS

Abstract:Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer.We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints.Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective.Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline.

* Accepted by NCMMSC2023

Via

Access Paper or Ask Questions

Passive Integrated Sensing and Communication Scheme based on RF Fingerprint Information Extraction for Cell-Free RAN

Nov 10, 2023

Jingxuan Yu, Fan Zeng, Jiamin Li, Feiyang Liu, Pengcheng Zhu, Dongming Wang, Xiaohu You

Abstract:This paper investigates how to achieve integrated sensing and communication (ISAC) based on a cell-free radio access network (CF-RAN) architecture with a minimum footprint of communication resources. We propose a new passive sensing scheme. The scheme is based on the radio frequency (RF) fingerprint learning of the RF radio unit (RRU) to build an RF fingerprint library of RRUs. The source RRU is identified by comparing the RF fingerprints carried by the signal at the receiver side. The receiver extracts the channel parameters from the signal and estimates the channel environment, thus locating the reflectors in the environment. The proposed scheme can effectively solve the problem of interference between signals in the same time-frequency domain but in different spatial domains when multiple RRUs jointly serve users in CF-RAN architecture. Simulation results show that the proposed passive ISAC scheme can effectively detect reflector location information in the environment without degrading the communication performance.

* 11 pages, 6 figures, submitted on 28-Feb-2023, China Communication, Accepted on 14-Sep-2023

Via

Access Paper or Ask Questions

DualVC 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion

Sep 27, 2023

Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Shuai Wang, Jixun Yao, Lei Xie, Mengxiao Bi

Figure 1 for DualVC 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion

Figure 2 for DualVC 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion

Figure 3 for DualVC 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion

Figure 4 for DualVC 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion

Abstract:Voice conversion is becoming increasingly popular, and a growing number of application scenarios require models with streaming inference capabilities. The recently proposed DualVC attempts to achieve this objective through streaming model architecture design and intra-model knowledge distillation along with hybrid predictive coding to compensate for the lack of future information. However, DualVC encounters several problems that limit its performance. First, the autoregressive decoder has error accumulation in its nature and limits the inference speed as well. Second, the causal convolution enables streaming capability but cannot sufficiently use future information within chunks. Third, the model is unable to effectively address the noise in the unvoiced segments, lowering the sound quality. In this paper, we propose DualVC 2 to address these issues. Specifically, the model backbone is migrated to a Conformer-based architecture, empowering parallel inference. Causal convolution is replaced by non-causal convolution with dynamic chunk mask to make better use of within-chunk future information. Also, quiet attention is introduced to enhance the model's noise robustness. Experiments show that DualVC 2 outperforms DualVC and other baseline systems in both subjective and objective metrics, with only 186.4 ms latency. Our audio samples are made publicly available.

Via

Access Paper or Ask Questions

Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models

Aug 31, 2023

Heyang Xue, Shuai Guo, Pengcheng Zhu, Mengxiao Bi

Figure 1 for Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models

Figure 2 for Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models

Abstract:Despite imperfect score-matching causing drift in training and sampling distributions of diffusion models, recent advances in diffusion-based acoustic models have revolutionized data-sufficient single-speaker Text-to-Speech (TTS) approaches, with Grad-TTS being a prime example. However, the sampling drift problem leads to these approaches struggling in multi-speaker scenarios in practice due to more complex target data distribution compared to single-speaker scenarios. In this paper, we present Multi-GradSpeech, a multi-speaker diffusion-based acoustic models which introduces the Consistent Diffusion Model (CDM) as a generative modeling approach. We enforce the consistency property of CDM during the training process to alleviate the sampling drift problem in the inference stage, resulting in significant improvements in multi-speaker TTS performance. Our experimental results corroborate that our proposed approach can improve the performance of different speakers involved in multi-speaker TTS compared to Grad-TTS, even outperforming the fine-tuning approach. Audio samples are available at https://welkinyang.github.io/multi-gradspeech/

Via

Access Paper or Ask Questions

Grouping Method for mmWave Massive MIMO System: Exploitation of Angular Multiplexing Gain

May 25, 2023

Peng Jiang, Pengcheng Zhu, Jiamin Li, Dongming Wang

Figure 1 for Grouping Method for mmWave Massive MIMO System: Exploitation of Angular Multiplexing Gain

Figure 2 for Grouping Method for mmWave Massive MIMO System: Exploitation of Angular Multiplexing Gain

Figure 3 for Grouping Method for mmWave Massive MIMO System: Exploitation of Angular Multiplexing Gain

Figure 4 for Grouping Method for mmWave Massive MIMO System: Exploitation of Angular Multiplexing Gain

Abstract:A future millimeter-wave (mmWave) massive multiple-input and multiple-output (MIMO) system may serve hundreds or thousands of users at the same time; thus, research on multiple access technology is particularly important.Moreover, due to the short-wavelength nature of a mmWave, large-scale arrays are easier to implement than microwaves, while their directivity and sparseness make the physical beamforming effect of precoding more prominent.In consideration of the mmWave angle division multiple access (ADMA) system based on precoding, this paper investigates the influence of the angle distribution on system performance, which is denoted as the angular multiplexing gain.Furthermore, inspired by the above research, we transform the ADMA user grouping problem to maximize the system sum-rate into the inter-user angular spacing equalization problem.Then, the form of the optimal solution for the approximate problem is derived, and the corresponding grouping algorithm is proposed.The simulation results demonstrate that the proposed algorithm performs better than the comparison methods.Finally, a complexity analysis also shows that the proposed algorithm has extremely low complexity.

* 12 pages,16 figures

Via

Access Paper or Ask Questions