Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rui Liu

Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement

Feb 24, 2025

Rui Liu

Figure 1 for Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement

Figure 2 for Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement

Figure 3 for Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement

Figure 4 for Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement

Abstract:Real-world image matting is essential for applications in content creation and augmented reality. However, it remains challenging due to the complex nature of scenes and the scarcity of high-quality datasets. To address these limitations, we introduce Mask2Alpha, an iterative refinement framework designed to enhance semantic comprehension, instance awareness, and fine-detail recovery in image matting. Our framework leverages self-supervised Vision Transformer features as semantic priors, strengthening contextual understanding in complex scenarios. To further improve instance differentiation, we implement a mask-guided feature selection module, enabling precise targeting of objects in multi-instance settings. Additionally, a sparse convolution-based optimization scheme allows Mask2Alpha to recover high-resolution details through progressive refinement,from low-resolution semantic passes to high-resolution sparse reconstructions. Benchmarking across various real-world datasets, Mask2Alpha consistently achieves state-of-the-art results, showcasing its effectiveness in accurate and efficient image matting.

* 9pages

Via

Access Paper or Ask Questions

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Jan 30, 2025

Yibo Wang, Tiansheng Huang, Li Shen, Huanjin Yao, Haotian Luo, Rui Liu, Naiqiang Tan, Jiaxing Huang, Dacheng Tao

Figure 1 for Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Figure 2 for Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Figure 3 for Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Figure 4 for Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Abstract:Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile -- with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution -- adding purely random perturbations to the fine-tuned model, can recover the model from harmful behavior, though it leads to a degradation in the model's fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.5%, while maintaining fine-tuning performance. As a by-product, we analyze the optimized perturbation and show that different layers in various LLMs have distinct safety coefficients. Source code available at https://github.com/w-yibo/Panacea

Via

Access Paper or Ask Questions

Return of the Encoder: Maximizing Parameter Efficiency for SLMs

Jan 27, 2025

Mohamed Elfeki, Rui Liu, Chad Voegele

Abstract:The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters or fewer - our systematic analysis across GPU, CPU, and NPU platforms reveals that encoder-decoder architectures achieve 47% lower first-token latency and 4.7x higher throughput compared to decoder-only models on edge devices. These gains may be attributed to encoder-decoder's one-time input processing and efficient separation of understanding and generation phases. We introduce a novel knowledge distillation framework that enables encoder-decoder models to leverage capabilities from large scalable decoder-only teachers while preserving their architectural advantages, achieving up to 6 average performance points improvement across diverse tasks, with significant gains in asymmetric sequence tasks where input and output distributions can benefit from different processing approaches. When combined with modern advances like Rotary Positional Embeddings (RoPE) and Vision encoders, our systematic investigation demonstrates that encoder-decoder architectures provide a more practical path toward deploying capable language models in resource-constrained environments. Our findings challenge the prevailing trend toward decoder-only scaling, showing that architectural choices become increasingly crucial as parameter budgets decrease, particularly for on-device and edge deployments where computational efficiency is paramount.

* 13 pages, 5 figures. LLMs/SLMs, encoder-decoder and decoder-only

Via

Access Paper or Ask Questions

Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

Jan 11, 2025

Rui Liu, Zhenqi Jia, Feilong Bao, Haizhou Li

Figure 1 for Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

Figure 2 for Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

Figure 3 for Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

Figure 4 for Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

Abstract:Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a reference to synthesize expressive speech that aligns with the conversational style. Unlike CD, stored dialogue (SD) contains preserved dialogue fragments from earlier stages of user-agent interaction, which include style expression knowledge relevant to scenarios similar to those in CD. Note that this knowledge plays a significant role in enabling the agent to synthesize expressive conversational speech that generates empathetic feedback. However, prior research has overlooked this aspect. To address this issue, we propose a novel Retrieval-Augmented Dialogue Knowledge Aggregation scheme for expressive CSS, termed RADKA-CSS, which includes three main components: 1) To effectively retrieve dialogues from SD that are similar to CD in terms of both semantic and style. First, we build a stored dialogue semantic-style database (SDSSD) which includes the text and audio samples. Then, we design a multi-attribute retrieval scheme to match the dialogue semantic and style vectors of the CD with the stored dialogue semantic and style vectors in the SDSSD, retrieving the most similar dialogues. 2) To effectively utilize the style knowledge from CD and SD, we propose adopting the multi-granularity graph structure to encode the dialogue and introducing a multi-source style knowledge aggregation mechanism. 3) Finally, the aggregated style knowledge are fed into the speech synthesizer to help the agent synthesize expressive speech that aligns with the conversational style. We conducted a comprehensive and in-depth experiment based on the DailyTalk dataset, which is a benchmarking dataset for the CSS task. Both objective and subjective evaluations demonstrate that RADKA-CSS outperforms baseline models in expressiveness rendering. Code and audio samples can be found at: https://github.com/Coder-jzq/RADKA-CSS.

* Accepted by Information Fusion 2025

Via

Access Paper or Ask Questions

Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition

Jan 03, 2025

Rui Liu, Hongyu Yuan, Haizhou Li

Figure 1 for Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition

Figure 2 for Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition

Figure 3 for Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition

Figure 4 for Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition

Abstract:Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of ``listening and seeing again''. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio and video compression representation respectively that can be understood by LLM. Afterward, the audio-visual compression representation and the N-Best hypothesis together constitute a Cross-modal Prompt to guide the LLM in producing the best transcription. In addition, we also proposed a Multi-Level Consistency Constraint training criterion, including logits-level, utterance-level and representations-level, to improve the correction accuracy while enhancing the interpretability of audio and visual compression representations. The experimental results on the LRS3 dataset show that our method outperforms current mainstream AVSR systems. The proposed AVGER can reduce the Word Error Rate (WER) by 24% compared to them. Code and models can be found at: https://github.com/CircleRedRain/AVGER.

Via

Access Paper or Ask Questions

Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction

Dec 25, 2024

Yuan Zhao, Rui Liu, Gaoxiang Cong

Abstract:Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence's prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the current sentence. Experiments on the Chem dataset show our model outperforms baselines in dubbing expressiveness. The code and demos are available at \textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.

* Accepted by ICSSP 2025

Via

Access Paper or Ask Questions

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis

Dec 25, 2024

Zhenqi Jia, Rui Liu

Abstract:Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal and two inter-modal interaction modules to deeply learn the intra-modal and inter-modal context interaction. In the inference phase, we take MDH and adopt trained interaction modules to fully infer the speech prosody of the target utterance's text content. Subjective and objective experiments on the DailyTalk dataset show that III-CSS outperforms the advanced baselines in terms of prosody expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/I3CSS.

* Accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech

Dec 17, 2024

Rui Liu, Shuwei He, Yifan Hu, Haizhou Li

Abstract:Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of an spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M2SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth images into patches and adopt the Gemini-generated environment captions to guide the local spatial understanding. After that, the multi-modal and multi-scale features are integrated by the local-aware global spatial understanding. In this way, M2SE-VTTS effectively models the interactions between local and global spatial contexts in the multi-modal spatial environment. Objective and subjective evaluations suggest that our model outperforms the advanced baselines in environmental speech generation. The code and audio samples are available at: https://github.com/AI-S2-Lab/M2SE-VTTS.

* 9 pages,2 figures, Accepted by AAAI'2025

Via

Access Paper or Ask Questions

Analyst Reports and Stock Performance: Evidence from the Chinese Market

Nov 13, 2024

Rui Liu, Jiayou Liang, Haolong Chen, Yujia Hu

Figure 1 for Analyst Reports and Stock Performance: Evidence from the Chinese Market

Figure 2 for Analyst Reports and Stock Performance: Evidence from the Chinese Market

Figure 3 for Analyst Reports and Stock Performance: Evidence from the Chinese Market

Figure 4 for Analyst Reports and Stock Performance: Evidence from the Chinese Market

Abstract:This article applies natural language processing (NLP) to extract and quantify textual information to predict stock performance. Using an extensive dataset of Chinese analyst reports and employing a customized BERT deep learning model for Chinese text, this study categorizes the sentiment of the reports as positive, neutral, or negative. The findings underscore the predictive capacity of this sentiment indicator for stock volatility, excess returns, and trading volume. Specifically, analyst reports with strong positive sentiment will increase excess return and intraday volatility, and vice versa, reports with strong negative sentiment also increase volatility and trading volume, but decrease future excess return. The magnitude of this effect is greater for positive sentiment reports than for negative sentiment reports. This article contributes to the empirical literature on sentiment analysis and the response of the stock market to news in the Chinese stock market.

Via

Access Paper or Ask Questions

Flexible Coded Distributed Convolution Computing for Enhanced Fault Tolerance and Numerical Stability in Distributed CNNs

Nov 03, 2024

Shuo Tan, Rui Liu, XianLei Long, Kai Wan, Linqi Song, Yong Li

Figure 1 for Flexible Coded Distributed Convolution Computing for Enhanced Fault Tolerance and Numerical Stability in Distributed CNNs

Figure 2 for Flexible Coded Distributed Convolution Computing for Enhanced Fault Tolerance and Numerical Stability in Distributed CNNs

Figure 3 for Flexible Coded Distributed Convolution Computing for Enhanced Fault Tolerance and Numerical Stability in Distributed CNNs

Figure 4 for Flexible Coded Distributed Convolution Computing for Enhanced Fault Tolerance and Numerical Stability in Distributed CNNs

Abstract:Deploying Convolutional Neural Networks (CNNs) on resource-constrained devices necessitates efficient management of computational resources, often via distributed systems susceptible to latency from straggler nodes. This paper introduces the Flexible Coded Distributed Convolution Computing (FCDCC) framework to enhance fault tolerance and numerical stability in distributed CNNs. We extend Coded Distributed Computing (CDC) with Circulant and Rotation Matrix Embedding (CRME) which was originally proposed for matrix multiplication to high-dimensional tensor convolution. For the proposed scheme, referred to as Numerically Stable Coded Tensor Convolution (NSCTC) scheme, we also propose two new coded partitioning schemes: Adaptive-Padding Coded Partitioning (APCP) for input tensor and Kernel-Channel Coded Partitioning (KCCP) for filter tensor. These strategies enable linear decomposition of tensor convolutions and encoding them into CDC sub-tasks, combining model parallelism with coded redundancy for robust and efficient execution. Theoretical analysis identifies an optimal trade-off between communication and storage costs. Empirical results validate the framework's effectiveness in computational efficiency, fault tolerance, and scalability across various CNN architectures.

* 14 pages, 6 figures

Via

Access Paper or Ask Questions