Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sheng Li

University of Pittsburgh

Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation

Apr 11, 2025

Haowei Lou, Hye-young Paik, Sheng Li, Wen Hu, Lina Yao

Figure 1 for Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation

Figure 2 for Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation

Figure 3 for Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation

Figure 4 for Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation

Abstract:Text-to-Speech (TTS) models can generate natural, human-like speech across multiple languages by transforming phonemes into waveforms. However, multilingual TTS remains challenging due to discrepancies in phoneme vocabularies and variations in prosody and speaking style across languages. Existing approaches either train separate models for each language, which achieve high performance at the cost of increased computational resources, or use a unified model for multiple languages that struggles to capture fine-grained, language-specific style variations. In this work, we propose LanStyleTTS, a non-autoregressive, language-aware style adaptive TTS framework that standardizes phoneme representations and enables fine-grained, phoneme-level style control across languages. This design supports a unified multilingual TTS model capable of producing accurate and high-quality speech without the need to train language-specific models. We evaluate LanStyleTTS by integrating it with several state-of-the-art non-autoregressive TTS architectures. Results show consistent performance improvements across different model backbones. Furthermore, we investigate a range of acoustic feature representations, including mel-spectrograms and autoencoder-derived latent features. Our experiments demonstrate that latent encodings can significantly reduce model size and computational cost while preserving high-quality speech generation.

Via

Access Paper or Ask Questions

Bridging Knowledge Gap Between Image Inpainting and Large-Area Visible Watermark Removal

Apr 07, 2025

Yicheng Leng, Chaowei Fang, Junye Chen, Yixiang Fang, Sheng Li, Guanbin Li

Abstract:Visible watermark removal which involves watermark cleaning and background content restoration is pivotal to evaluate the resilience of watermarks. Existing deep neural network (DNN)-based models still struggle with large-area watermarks and are overly dependent on the quality of watermark mask prediction. To overcome these challenges, we introduce a novel feature adapting framework that leverages the representation modeling capacity of a pre-trained image inpainting model. Our approach bridges the knowledge gap between image inpainting and watermark removal by fusing information of the residual background content beneath watermarks into the inpainting backbone model. We establish a dual-branch system to capture and embed features from the residual background content, which are merged into intermediate features of the inpainting backbone model via gated feature fusion modules. Moreover, for relieving the dependence on high-quality watermark masks, we introduce a new training paradigm by utilizing coarse watermark masks to guide the inference process. This contributes to a visible image removal model which is insensitive to the quality of watermark mask during testing. Extensive experiments on both a large-scale synthesized dataset and a real-world dataset demonstrate that our approach significantly outperforms existing state-of-the-art methods. The source code is available in the supplementary materials.

* To be published in AAAI 2025

Via

Access Paper or Ask Questions

Hypothesis Testing for Progressive Kernel Estimation and VCM Framework

Apr 06, 2025

Zehui Lin, Chenxiao Hu, Jinzhu Jia, Sheng Li

Abstract:Identifying an appropriate radius for unbiased kernel estimation is crucial for the efficiency of radiance estimation. However, determining both the radius and unbiasedness still faces big challenges. In this paper, we first propose a statistical model of photon samples and associated contributions for progressive kernel estimation, under which the kernel estimation is unbiased if the null hypothesis of this statistical model stands. Then, we present a method to decide whether to reject the null hypothesis about the statistical population (i.e., photon samples) by the F-test in the Analysis of Variance. Hereby, we implement a progressive photon mapping (PPM) algorithm, wherein the kernel radius is determined by this hypothesis test for unbiased radiance estimation. Secondly, we propose VCM+, a reinforcement of Vertex Connection and Merging (VCM), and derive its theoretically unbiased formulation. VCM+ combines hypothesis testing-based PPM with bidirectional path tracing (BDPT) via multiple importance sampling (MIS), wherein our kernel radius can leverage the contributions from PPM and BDPT. We test our new algorithms, improved PPM and VCM+, on diverse scenarios with different lighting settings. The experimental results demonstrate that our method can alleviate light leaks and visual blur artifacts of prior radiance estimate algorithms. We also evaluate the asymptotic performance of our approach and observe an overall improvement over the baseline in all testing scenarios.

* This paper has been published in IEEE Transactions on Visualization and Computer Graphics. This version is a preprint one

Via

Access Paper or Ask Questions

Visual Acuity Consistent Foveated Rendering towards Retinal Resolution

Mar 30, 2025

Zhi Zhang, Meng Gai, Sheng Li

Figure 1 for Visual Acuity Consistent Foveated Rendering towards Retinal Resolution

Figure 2 for Visual Acuity Consistent Foveated Rendering towards Retinal Resolution

Figure 3 for Visual Acuity Consistent Foveated Rendering towards Retinal Resolution

Figure 4 for Visual Acuity Consistent Foveated Rendering towards Retinal Resolution

Abstract:Prior foveated rendering methods often suffer from a limitation where the shading load escalates with increasing display resolution, leading to decreased efficiency, particularly when dealing with retinal-level resolutions. To tackle this challenge, we begin with the essence of the human visual system (HVS) perception and present visual acuity-consistent foveated rendering (VaFR), aiming to achieve exceptional rendering performance at retinal-level resolutions. Specifically, we propose a method with a novel log-polar mapping function derived from the human visual acuity model, which accommodates the natural bandwidth of the visual system. This mapping function and its associated shading rate guarantee a consistent output of rendering information, regardless of variations in the display resolution of the VR HMD. Consequently, our VaFR outperforms alternative methods, improving rendering speed while preserving perceptual visual quality, particularly when operating at retinal resolutions. We validate our approach using both the rasterization and ray-casting rendering pipelines. We also validate our approach using different binocular rendering strategies for HMD devices. In diverse testing scenarios, our approach delivers better perceptual visual quality than prior foveated rendering while achieving an impressive speedup of 6.5$\times$-9.29$\times$ for deferred rendering of 3D scenarios and an even more powerful speedup of 10.4$\times$-16.4$\times$ for ray-casting at retinal resolution. Additionally, our approach significantly enhances the rendering performance of binocular 8K path tracing, achieving smooth frame rates.

Via

Access Paper or Ask Questions

Unified Dense Prediction of Video Diffusion

Mar 12, 2025

Lehan Yang, Lu Qi, Xiangtai Li, Sheng Li, Varun Jampani, Ming-Hsuan Yang

Figure 1 for Unified Dense Prediction of Video Diffusion

Figure 2 for Unified Dense Prediction of Video Diffusion

Figure 3 for Unified Dense Prediction of Video Diffusion

Figure 4 for Unified Dense Prediction of Video Diffusion

Abstract:We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset~\datasetname, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness.

* Accepted by CVPR2025

Via

Access Paper or Ask Questions

Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English

Mar 06, 2025

Runtao Zhou, Guangya Wan, Saadia Gabriel, Sheng Li, Alexander J Gates, Maarten Sap, Thomas Hartvigsen

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning tasks, leading to their widespread deployment. However, recent studies have highlighted concerning biases in these models, particularly in their handling of dialectal variations like African American English (AAE). In this work, we systematically investigate dialectal disparities in LLM reasoning tasks. We develop an experimental framework comparing LLM performance given Standard American English (SAE) and AAE prompts, combining LLM-based dialect conversion with established linguistic analyses. We find that LLMs consistently produce less accurate responses and simpler reasoning chains and explanations for AAE inputs compared to equivalent SAE questions, with disparities most pronounced in social science and humanities domains. These findings highlight systematic differences in how LLMs process and reason about different language varieties, raising important questions about the development and deployment of these systems in our multilingual and multidialectal world. Our code repository is publicly available at https://github.com/Runtaozhou/dialect_bias_eval.

* ARR Under Review, First two authors contribute equally

Via

Access Paper or Ask Questions

Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

Jan 29, 2025

Zhengdong Yang, Qianying Liu, Sheng Li, Fei Cheng, Chenhui Chu

Figure 1 for Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

Figure 2 for Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

Figure 3 for Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

Figure 4 for Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

Abstract:We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.

Via

Access Paper or Ask Questions

Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Jan 13, 2025

Jiliang Hu, Zuchao Li, Mengjia Shen, Haojun Ai, Sheng Li, Jun Zhang

Figure 1 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Figure 2 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Figure 3 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Figure 4 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Abstract:Spoken language understanding (SLU) is a structure prediction task in the field of speech. Recently, many works on SLU that treat it as a sequence-to-sequence task have achieved great success. However, This method is not suitable for simultaneous speech recognition and understanding. In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. We conduct experiments on name entity recognition and intent classification using the Chinese dataset AISHELL-NER and the English dataset SLURP. The results show that our proposed method not only outperforms the traditional sequence-to-sequence method in both transcription and extraction capabilities but also achieves state-of-the-art performance on the two datasets.

* 5 pages, 2 figures, accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Exploring Depth Information for Detecting Manipulated Face Videos

Nov 27, 2024

Haoyue Wang, Sheng Li, Ji He, Zhenxing Qian, Xinpeng Zhang, Shaolin Fan

Abstract:Face manipulation detection has been receiving a lot of attention for the reliability and security of the face images/videos. Recent studies focus on using auxiliary information or prior knowledge to capture robust manipulation traces, which are shown to be promising. As one of the important face features, the face depth map, which has shown to be effective in other areas such as face recognition or face detection, is unfortunately paid little attention to in literature for face manipulation detection. In this paper, we explore the possibility of incorporating the face depth map as auxiliary information for robust face manipulation detection. To this end, we first propose a Face Depth Map Transformer (FDMT) to estimate the face depth map patch by patch from an RGB face image, which is able to capture the local depth anomaly created due to manipulation. The estimated face depth map is then considered as auxiliary information to be integrated with the backbone features using a Multi-head Depth Attention (MDA) mechanism that is newly designed. We also propose an RGB-Depth Inconsistency Attention (RDIA) module to effectively capture the inter-frame inconsistency for multi-frame input. Various experiments demonstrate the advantage of our proposed method for face manipulation detection.

* 12 pages, 10 figures. arXiv admin note: substantial text overlap with arXiv:2212.14230

Via

Access Paper or Ask Questions

A Survey of Deep Graph Learning under Distribution Shifts: from Graph Out-of-Distribution Generalization to Adaptation

Oct 25, 2024

Kexin Zhang, Shuhan Liu, Song Wang, Weili Shi, Chen Chen, Pan Li, Sheng Li, Jundong Li, Kaize Ding

Abstract:Distribution shifts on graphs -- the discrepancies in data distribution between training and employing a graph machine learning model -- are ubiquitous and often unavoidable in real-world scenarios. These shifts may severely deteriorate model performance, posing significant challenges for reliable graph machine learning. Consequently, there has been a surge in research on graph machine learning under distribution shifts, aiming to train models to achieve satisfactory performance on out-of-distribution (OOD) test data. In our survey, we provide an up-to-date and forward-looking review of deep graph learning under distribution shifts. Specifically, we cover three primary scenarios: graph OOD generalization, training-time graph OOD adaptation, and test-time graph OOD adaptation. We begin by formally formulating the problems and discussing various types of distribution shifts that can affect graph learning, such as covariate shifts and concept shifts. To provide a better understanding of the literature, we systematically categorize the existing models based on our proposed taxonomy and investigate the adopted techniques behind. We also summarize commonly used datasets in this research area to facilitate further investigation. Finally, we point out promising research directions and the corresponding challenges to encourage further study in this vital domain. Additionally, we provide a continuously updated reading list at https://github.com/kaize0409/Awesome-Graph-OOD.

* 18 pages, 2 figures. arXiv admin note: text overlap with arXiv:2402.11153

Via

Access Paper or Ask Questions