Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Liang Li

Key Lab of Intell. Info. Process., Inst. of Comput. Tech., Chinese Academy of Sciences

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Dec 12, 2024

Gaoxiang Cong, Jiadong Pan, Liang Li, Yuankai Qi, Yuxin Peng, Anton van den Hengel, Jian Yang, Qingming Huang

Figure 1 for EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Figure 2 for EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Figure 3 for EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Figure 4 for EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Abstract:Given a piece of text, a video clip, and a reference audio, the movie dubbing task aims to generate speech that aligns with the video while cloning the desired voice. The existing methods have two primary deficiencies: (1) They struggle to simultaneously hold audio-visual sync and achieve clear pronunciation; (2) They lack the capacity to express user-defined emotions. To address these problems, we propose EmoDubber, an emotion-controllable dubbing architecture that allows users to specify emotion type and emotional intensity while satisfying high-quality lip sync and pronunciation. Specifically, we first design Lip-related Prosody Aligning (LPA), which focuses on learning the inherent consistency between lip motion and prosody variation by duration level contrastive learning to incorporate reasonable alignment. Then, we design Pronunciation Enhancing (PE) strategy to fuse the video-level phoneme sequences by efficient conformer to improve speech intelligibility. Next, the speaker identity adapting module aims to decode acoustics prior and inject the speaker style embedding. After that, the proposed Flow-based User Emotion Controlling (FUEC) is used to synthesize waveform by flow matching prediction network conditioned on acoustics prior. In this process, the FUEC determines the gradient direction and guidance scale based on the user's emotion instructions by the positive and negative guidance mechanism, which focuses on amplifying the desired emotion while suppressing others. Extensive experimental results on three benchmark datasets demonstrate favorable performance compared to several state-of-the-art methods.

* Under review

Via

Access Paper or Ask Questions

Multi-robot autonomous 3D reconstruction using Gaussian splatting with Semantic guidance

Dec 03, 2024

Jing Zeng, Qi Ye, Tianle Liu, Yang Xu, Jin Li, Jinming Xu, Liang Li, Jiming Chen

Figure 1 for Multi-robot autonomous 3D reconstruction using Gaussian splatting with Semantic guidance

Figure 2 for Multi-robot autonomous 3D reconstruction using Gaussian splatting with Semantic guidance

Figure 3 for Multi-robot autonomous 3D reconstruction using Gaussian splatting with Semantic guidance

Figure 4 for Multi-robot autonomous 3D reconstruction using Gaussian splatting with Semantic guidance

Abstract:Implicit neural representations and 3D Gaussian splatting (3DGS) have shown great potential for scene reconstruction. Recent studies have expanded their applications in autonomous reconstruction through task assignment methods. However, these methods are mainly limited to single robot, and rapid reconstruction of large-scale scenes remains challenging. Additionally, task-driven planning based on surface uncertainty is prone to being trapped in local optima. To this end, we propose the first 3DGS-based centralized multi-robot autonomous 3D reconstruction framework. To further reduce time cost of task generation and improve reconstruction quality, we integrate online open-vocabulary semantic segmentation with surface uncertainty of 3DGS, focusing view sampling on regions with high instance uncertainty. Finally, we develop a multi-robot collaboration strategy with mode and task assignments improving reconstruction quality while ensuring planning efficiency. Our method demonstrates the highest reconstruction quality among all planning methods and superior planning efficiency compared to existing multi-robot methods. We deploy our method on multiple robots, and results show that it can effectively plan view paths and reconstruct scenes with high quality.

Via

Access Paper or Ask Questions

Energy-Efficient Split Learning for Fine-Tuning Large Language Models in Edge Networks

Nov 27, 2024

Zuguang Li, Shaohua Wu, Liang Li, Songge Zhang

Abstract:In this letter, we propose an energy-efficient split learning (SL) framework for fine-tuning large language models (LLMs) using geo-distributed personal data at the network edge, where LLMs are split and alternately across massive mobile devices and an edge server. Considering the device heterogeneity and channel dynamics in edge networks, a Cut lAyer and computing Resource Decision (CARD) algorithm is developed to minimize training delay and energy consumption. Simulation results demonstrate that the proposed approach reduces the average training delay and server's energy consumption by 70.8\% and 53.1\%, compared to the benchmarks, respectively.

* 5 pages, 6 figures

Via

Access Paper or Ask Questions

Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation

Nov 25, 2024

Peihua Deng, Jiehua Zhang, Xichun Sheng, Chenggang Yan, Yaoqi Sun, Ying Fu, Liang Li

Figure 1 for Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation

Figure 2 for Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation

Figure 3 for Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation

Figure 4 for Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation

Abstract:This paper explores the Class-Incremental Source-Free Unsupervised Domain Adaptation (CI-SFUDA) problem, where the unlabeled target data come incrementally without access to labeled source instances. This problem poses two challenges, the disturbances of similar source-class knowledge to target-class representation learning and the new target knowledge to old ones. To address them, we propose the Multi-Granularity Class Prototype Topology Distillation (GROTO) algorithm, which effectively transfers the source knowledge to the unlabeled class-incremental target domain. Concretely, we design the multi-granularity class prototype self-organization module and prototype topology distillation module. Firstly, the positive classes are mined by modeling two accumulation distributions. Then, we generate reliable pseudo-labels by introducing multi-granularity class prototypes, and use them to promote the positive-class target feature self-organization. Secondly, the positive-class prototypes are leveraged to construct the topological structures of source and target feature spaces. Then, we perform the topology distillation to continually mitigate the interferences of new target knowledge to old ones. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performances on three public datasets.

* 10 pages, 6 figures

Via

Access Paper or Ask Questions

Chanel-Orderer: A Channel-Ordering Predictor for Tri-Channel Natural Images

Nov 20, 2024

Shen Li, Lei Jiang, Wei Wang, Hongwei Hu, Liang Li

Figure 1 for Chanel-Orderer: A Channel-Ordering Predictor for Tri-Channel Natural Images

Figure 2 for Chanel-Orderer: A Channel-Ordering Predictor for Tri-Channel Natural Images

Figure 3 for Chanel-Orderer: A Channel-Ordering Predictor for Tri-Channel Natural Images

Figure 4 for Chanel-Orderer: A Channel-Ordering Predictor for Tri-Channel Natural Images

Abstract:This paper shows a proof-of-concept that, given a typical 3-channel images but in a randomly permuted channel order, a model (termed as Chanel-Orderer) with ad-hoc inductive biases in terms of both architecture and loss functions can accurately predict the channel ordering and knows how to make it right. Specifically, Chanel-Orderer learns to score each of the three channels with the priors of object semantics and uses the resulting scores to predict the channel ordering. This brings up benefits into a typical scenario where an \texttt{RGB} image is often mis-displayed in the \texttt{BGR} format and needs to be corrected into the right order. Furthermore, as a byproduct, the resulting model Chanel-Orderer is able to tell whether a given image is a near-gray-scale image (near-monochromatic) or not (polychromatic). Our research suggests that Chanel-Orderer mimics human visual coloring of our physical natural world.

Via

Access Paper or Ask Questions

TexPro: Text-guided PBR Texturing with Procedural Material Modeling

Oct 21, 2024

Ziqiang Dang, Wenqi Dong, Zesong Yang, Bangbang Yang, Liang Li, Yuewen Ma, Zhaopeng Cui

Figure 1 for TexPro: Text-guided PBR Texturing with Procedural Material Modeling

Figure 2 for TexPro: Text-guided PBR Texturing with Procedural Material Modeling

Figure 3 for TexPro: Text-guided PBR Texturing with Procedural Material Modeling

Figure 4 for TexPro: Text-guided PBR Texturing with Procedural Material Modeling

Abstract:In this paper, we present TexPro, a novel method for high-fidelity material generation for input 3D meshes given text prompts. Unlike existing text-conditioned texture generation methods that typically generate RGB textures with baked lighting, TexPro is able to produce diverse texture maps via procedural material modeling, which enables physical-based rendering, relighting, and additional benefits inherent to procedural materials. Specifically, we first generate multi-view reference images given the input textual prompt by employing the latest text-to-image model. We then derive texture maps through a rendering-based optimization with recent differentiable procedural materials. To this end, we design several techniques to handle the misalignment between the generated multi-view images and 3D meshes, and introduce a novel material agent that enhances material classification and matching by exploring both part-level understanding and object-aware material reasoning. Experiments demonstrate the superiority of the proposed method over existing SOTAs and its capability of relighting.

* In submission. Supplementary material is included at the end of the main paper (5 pages, 2 figures)

Via

Access Paper or Ask Questions

A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration

Oct 14, 2024

Renlang Huang, Yufan Tang, Jiming Chen, Liang Li

Figure 1 for A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration

Figure 2 for A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration

Figure 3 for A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration

Figure 4 for A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration

Abstract:Deep learning-based feature matching has shown great superiority for point cloud registration in the absence of pose priors. Although coarse-to-fine matching approaches are prevalent, the coarse matching of existing methods is typically sparse and loose without consideration of geometric consistency, which makes the subsequent fine matching rely on ineffective optimal transport and hypothesis-and-selection methods for consistency. Therefore, these methods are neither efficient nor scalable for real-time applications such as odometry in robotics. To address these issues, we design a consistency-aware spot-guided Transformer (CAST), which incorporates a spot-guided cross-attention module to avoid interfering with irrelevant areas, and a consistency-aware self-attention module to enhance matching capabilities with geometrically consistent correspondences. Furthermore, a lightweight fine matching module for both sparse keypoints and dense features can estimate the transformation accurately. Extensive experiments on both outdoor LiDAR point cloud datasets and indoor RGBD point cloud datasets demonstrate that our method achieves state-of-the-art accuracy, efficiency, and robustness.

* Accepted by NeurIPS 2024 as poster

Via

Access Paper or Ask Questions

CalliffusionV2: Personalized Natural Calligraphy Generation with Flexible Multi-modal Control

Oct 03, 2024

Qisheng Liao, Liang Li, Yulang Fei, Gus Xia

Figure 1 for CalliffusionV2: Personalized Natural Calligraphy Generation with Flexible Multi-modal Control

Figure 2 for CalliffusionV2: Personalized Natural Calligraphy Generation with Flexible Multi-modal Control

Figure 3 for CalliffusionV2: Personalized Natural Calligraphy Generation with Flexible Multi-modal Control

Figure 4 for CalliffusionV2: Personalized Natural Calligraphy Generation with Flexible Multi-modal Control

Abstract:In this paper, we introduce CalliffusionV2, a novel system designed to produce natural Chinese calligraphy with flexible multi-modal control. Unlike previous approaches that rely solely on image or text inputs and lack fine-grained control, our system leverages both images to guide generations at fine-grained levels and natural language texts to describe the features of generations. CalliffusionV2 excels at creating a broad range of characters and can quickly learn new styles through a few-shot learning approach. It is also capable of generating non-Chinese characters without prior training. Comprehensive tests confirm that our system produces calligraphy that is both stylistically accurate and recognizable by neural network classifiers and human evaluators.

* 11 pages, 7 figures

Via

Access Paper or Ask Questions

Quantum Machine Learning for Semiconductor Fabrication: Modeling GaN HEMT Contact Process

Sep 17, 2024

Zeheng Wang, Fangzhou Wang, Liang Li, Zirui Wang, Timothy van der Laan, Ross C. C. Leon, Jing-Kai Huang, Muhammad Usman

Figure 1 for Quantum Machine Learning for Semiconductor Fabrication: Modeling GaN HEMT Contact Process

Figure 2 for Quantum Machine Learning for Semiconductor Fabrication: Modeling GaN HEMT Contact Process

Figure 3 for Quantum Machine Learning for Semiconductor Fabrication: Modeling GaN HEMT Contact Process

Figure 4 for Quantum Machine Learning for Semiconductor Fabrication: Modeling GaN HEMT Contact Process

Abstract:This paper pioneers the use of quantum machine learning (QML) for modeling the Ohmic contact process in GaN high-electron-mobility transistors (HEMTs) for the first time. Utilizing data from 159 devices and variational auto-encoder-based augmentation, we developed a quantum kernel-based regressor (QKR) with a 2-level ZZ-feature map. Benchmarking against six classical machine learning (CML) models, our QKR consistently demonstrated the lowest mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE). Repeated statistical analysis confirmed its robustness. Additionally, experiments verified an MAE of 0.314 ohm-mm, underscoring the QKR's superior performance and potential for semiconductor applications, and demonstrating significant advancements over traditional CML methods.

* This is the manuscript in the conference version. An expanded version for the journal will be released later and more information will be added. The author list, content, conclusion, and figures may change due to further research

Via

Access Paper or Ask Questions

Generating High-quality Symbolic Music Using Fine-grained Discriminators

Aug 03, 2024

Zhedong Zhang, Liang Li, Jiehua Zhang, Zhenghui Hu, Hongkui Wang, Chenggang Yan, Jian Yang, Yuankai Qi

Figure 1 for Generating High-quality Symbolic Music Using Fine-grained Discriminators

Figure 2 for Generating High-quality Symbolic Music Using Fine-grained Discriminators

Figure 3 for Generating High-quality Symbolic Music Using Fine-grained Discriminators

Figure 4 for Generating High-quality Symbolic Music Using Fine-grained Discriminators

Abstract:Existing symbolic music generation methods usually utilize discriminator to improve the quality of generated music via global perception of music. However, considering the complexity of information in music, such as rhythm and melody, a single discriminator cannot fully reflect the differences in these two primary dimensions of music. In this work, we propose to decouple the melody and rhythm from music, and design corresponding fine-grained discriminators to tackle the aforementioned issues. Specifically, equipped with a pitch augmentation strategy, the melody discriminator discerns the melody variations presented by the generated samples. By contrast, the rhythm discriminator, enhanced with bar-level relative positional encoding, focuses on the velocity of generated notes. Such a design allows the generator to be more explicitly aware of which aspects should be adjusted in the generated music, making it easier to mimic human-composed music. Experimental results on the POP909 benchmark demonstrate the favorable performance of the proposed method compared to several state-of-the-art methods in terms of both objective and subjective metrics.

* Accepted by ICPR2024

Via

Access Paper or Ask Questions