Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuguang Cui

Sherman

GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance

Dec 12, 2023

Haiming Zhang, Zhihao Yuan, Chaoda Zheng, Xu Yan, Baoyuan Wang, Guanbin Li, Song Wu, Shuguang Cui, Zhen Li

Figure 1 for GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance

Figure 2 for GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance

Figure 3 for GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance

Figure 4 for GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance

Abstract:Although existing speech-driven talking face generation methods achieve significant progress, they are far from real-world application due to the avatar-specific training demand and unstable lip movements. To address the above issues, we propose the GSmoothFace, a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model, which can synthesize smooth lip dynamics while preserving the speaker's identity. Our proposed GSmoothFace model mainly consists of the Audio to Expression Prediction (A2EP) module and the Target Adaptive Face Translation (TAFT) module. Specifically, we first develop the A2EP module to predict expression parameters synchronized with the driven speech. It uses a transformer to capture the long-term audio context and learns the parameters from the fine-grained 3D facial vertices, resulting in accurate and smooth lip-synchronization performance. Afterward, the well-designed TAFT module, empowered by Morphology Augmented Face Blending (MAFB), takes the predicted expression parameters and target video as inputs to modify the facial region of the target video without distorting the background content. The TAFT effectively exploits the identity appearance and background context in the target video, which makes it possible to generalize to different speakers without retraining. Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality. See the project page for code, data, and request pre-trained models: https://zhanghm1995.github.io/GSmoothFace.

Via

Access Paper or Ask Questions

MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

Dec 05, 2023

Zhangyang Xiong, Chenghong Li, Kenkun Liu, Hongjie Liao, Jianqiao Hu, Junyi Zhu, Shuliang Ning, Lingteng Qiu, Chongjie Wang, Shijie Wang(+2 more)

Figure 1 for MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

Figure 2 for MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

Figure 3 for MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

Figure 4 for MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

Abstract:In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap, we present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot studies on view-consistent action recognition, human NeRF reconstruction, text-driven view-unconstrained human image generation, as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.

* Project page: https://x-zhangyang.github.io/MVHumanNet/

Via

Access Paper or Ask Questions

Learning for Semantic Knowledge Base-Guided Online Feature Transmission in Dynamic Channels

Nov 30, 2023

Xiangyu Gao, Yaping Sun, Dongyu Wei, Xiaodong Xu, Hao Chen, Hao Yin, Shuguang Cui

Figure 1 for Learning for Semantic Knowledge Base-Guided Online Feature Transmission in Dynamic Channels

Figure 2 for Learning for Semantic Knowledge Base-Guided Online Feature Transmission in Dynamic Channels

Figure 3 for Learning for Semantic Knowledge Base-Guided Online Feature Transmission in Dynamic Channels

Abstract:With the proliferation of edge computing, efficient AI inference on edge devices has become essential for intelligent applications such as autonomous vehicles and VR/AR. In this context, we address the problem of efficient remote object recognition by optimizing feature transmission between mobile devices and edge servers. We propose an online optimization framework to address the challenge of dynamic channel conditions and device mobility in an end-to-end communication system. Our approach builds upon existing methods by leveraging a semantic knowledge base to drive multi-level feature transmission, accounting for temporal factors and dynamic elements throughout the transmission process. To solve the online optimization problem, we design a novel soft actor-critic-based deep reinforcement learning system with a carefully designed reward function for real-time decision-making, overcoming the optimization difficulty of the NP-hard problem and achieving the minimization of semantic loss while respecting latency constraints. Numerical results showcase the superiority of our approach compared to traditional greedy methods under various system setups.

* 6 pages

Via

Access Paper or Ask Questions

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Nov 26, 2023

Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li

Figure 1 for Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Figure 2 for Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Figure 3 for Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Figure 4 for Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Abstract:3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.

* Under review, project website: https://curryyuan.github.io/ZSVG3D/

Via

Access Paper or Ask Questions

Knowledge Base Enabled Semantic Communication: A Generative Perspective

Nov 21, 2023

Jinke Ren, Zezhong Zhang, Jie Xu, Guanying Chen, Yaping Sun, Ping Zhang, Shuguang Cui

Figure 1 for Knowledge Base Enabled Semantic Communication: A Generative Perspective

Figure 2 for Knowledge Base Enabled Semantic Communication: A Generative Perspective

Figure 3 for Knowledge Base Enabled Semantic Communication: A Generative Perspective

Figure 4 for Knowledge Base Enabled Semantic Communication: A Generative Perspective

Abstract:Semantic communication is widely touted as a key technology for propelling the sixth-generation (6G) wireless networks. However, providing effective semantic representation is quite challenging in practice. To address this issue, this article takes a crack at exploiting semantic knowledge base (KB) to usher in a new era of generative semantic communication. Via semantic KB, source messages can be characterized in low-dimensional subspaces without compromising their desired meaning, thus significantly enhancing the communication efficiency. The fundamental principle of semantic KB is first introduced, and a generative semantic communication architecture is developed by presenting three sub-KBs, namely source, task, and channel KBs. Then, the detailed construction approaches for each sub-KB are described, followed by their utilization in terms of semantic coding and transmission. A case study is also provided to showcase the superiority of generative semantic communication over conventional syntactic communication and classical semantic communication. In a nutshell, this article establishes a scientific foundation for the exciting uncharted frontier of generative semantic communication.

* Submitted to IEEE for possible publication

Via

Access Paper or Ask Questions

Integrating Sensing, Communication, and Power Transfer: Multiuser Beamforming Design

Nov 15, 2023

Ziqin Zhou, Xiaoyang Li, Guangxu Zhu, Jie Xu, Kaibin Huang, Shuguang Cui

Figure 1 for Integrating Sensing, Communication, and Power Transfer: Multiuser Beamforming Design

Figure 2 for Integrating Sensing, Communication, and Power Transfer: Multiuser Beamforming Design

Figure 3 for Integrating Sensing, Communication, and Power Transfer: Multiuser Beamforming Design

Figure 4 for Integrating Sensing, Communication, and Power Transfer: Multiuser Beamforming Design

Abstract:In the sixth-generation (6G) networks, massive low-power devices are expected to sense environment and deliver tremendous data. To enhance the radio resource efficiency, the integrated sensing and communication (ISAC) technique exploits the sensing and communication functionalities of signals, while the simultaneous wireless information and power transfer (SWIPT) techniques utilizes the same signals as the carriers for both information and power delivery. The further combination of ISAC and SWIPT leads to the advanced technology namely integrated sensing, communication, and power transfer (ISCPT). In this paper, a multi-user multiple-input multiple-output (MIMO) ISCPT system is considered, where a base station equipped with multiple antennas transmits messages to multiple information receivers (IRs), transfers power to multiple energy receivers (ERs), and senses a target simultaneously. The sensing target can be regarded as a point or an extended surface. When the locations of IRs and ERs are separated, the MIMO beamforming designs are optimized to improve the sensing performance while meeting the communication and power transfer requirements. The resultant non-convex optimization problems are solved based on a series of techniques including Schur complement transformation and rank reduction. Moreover, when the IRs and ERs are co-located, the power splitting factors are jointly optimized together with the beamformers to balance the performance of communication and power transfer. To better understand the performance of ISCPT, the target positioning problem is further investigated. Simulations are conducted to verify the effectiveness of our proposed designs, which also reveal a performance tradeoff among sensing, communication, and power transfer.

* This paper has been submitted to IEEE for possible publication

Via

Access Paper or Ask Questions

Generative AI for Integrated Sensing and Communication: Insights from the Physical Layer Perspective

Oct 02, 2023

Jiacheng Wang, Hongyang Du, Dusit Niyato, Jiawen Kang, Shuguang Cui, Xuemin, Shen, Ping Zhang

Abstract:As generative artificial intelligence (GAI) models continue to evolve, their generative capabilities are increasingly enhanced and being used extensively in content generation. Beyond this, GAI also excels in data modeling and analysis, benefitting wireless communication systems. In this article, we investigate applications of GAI in the physical layer and analyze its support for integrated sensing and communications (ISAC) systems. Specifically, we first provide an overview of GAI and ISAC, touching on GAI's potential support across multiple layers of ISAC. We then concentrate on the physical layer, investigating GAI's applications from various perspectives thoroughly, such as channel estimation, and demonstrate the value of these GAI-enhanced physical layer technologies for ISAC systems. In the case study, the proposed diffusion model-based method effectively estimates the signal direction of arrival under the near-field condition based on the uniform linear array, when antenna spacing surpassing half the wavelength. With a mean square error of 1.03 degrees, it confirms GAI's support for the physical layer in near-field sensing and communications.

Via

Access Paper or Ask Questions

A Tutorial on Environment-Aware Communications via Channel Knowledge Map for 6G

Sep 14, 2023

Yong Zeng, Junting Chen, Jie Xu, Di Wu, Xiaoli Xu, Shi Jin, Xiqi Gao, David Gesbert, Shuguang Cui, Rui Zhang

Abstract:Sixth-generation (6G) mobile communication networks are expected to have dense infrastructures, large-dimensional channels, cost-effective hardware, diversified positioning methods, and enhanced intelligence. Such trends bring both new challenges and opportunities for the practical design of 6G. On one hand, acquiring channel state information (CSI) in real time for all wireless links becomes quite challenging in 6G. On the other hand, there would be numerous data sources in 6G containing high-quality location-tagged channel data, making it possible to better learn the local wireless environment. By exploiting such new opportunities and for tackling the CSI acquisition challenge, there is a promising paradigm shift from the conventional environment-unaware communications to the new environment-aware communications based on the novel approach of channel knowledge map (CKM). This article aims to provide a comprehensive tutorial overview on environment-aware communications enabled by CKM to fully harness its benefits for 6G. First, the basic concept of CKM is presented, and a comparison of CKM with various existing channel inference techniques is discussed. Next, the main techniques for CKM construction are discussed, including both the model-free and model-assisted approaches. Furthermore, a general framework is presented for the utilization of CKM to achieve environment-aware communications, followed by some typical CKM-aided communication scenarios. Finally, important open problems in CKM research are highlighted and potential solutions are discussed to inspire future work.

Via

Access Paper or Ask Questions

Efficient View Synthesis with Neural Radiance Distribution Field

Aug 22, 2023

Yushuang Wu, Xiao Li, Jinglu Wang, Xiaoguang Han, Shuguang Cui, Yan Lu

Figure 1 for Efficient View Synthesis with Neural Radiance Distribution Field

Figure 2 for Efficient View Synthesis with Neural Radiance Distribution Field

Figure 3 for Efficient View Synthesis with Neural Radiance Distribution Field

Figure 4 for Efficient View Synthesis with Neural Radiance Distribution Field

Abstract:Recent work on Neural Radiance Fields (NeRF) has demonstrated significant advances in high-quality view synthesis. A major limitation of NeRF is its low rendering efficiency due to the need for multiple network forwardings to render a single pixel. Existing methods to improve NeRF either reduce the number of required samples or optimize the implementation to accelerate the network forwarding. Despite these efforts, the problem of multiple sampling persists due to the intrinsic representation of radiance fields. In contrast, Neural Light Fields (NeLF) reduce the computation cost of NeRF by querying only one single network forwarding per pixel. To achieve a close visual quality to NeRF, existing NeLF methods require significantly larger network capacities which limits their rendering efficiency in practice. In this work, we propose a new representation called Neural Radiance Distribution Field (NeRDF) that targets efficient view synthesis in real-time. Specifically, we use a small network similar to NeRF while preserving the rendering speed with a single network forwarding per pixel as in NeLF. The key is to model the radiance distribution along each ray with frequency basis and predict frequency weights using the network. Pixel values are then computed via volume rendering on radiance distributions. Experiments show that our proposed method offers a better trade-off among speed, quality, and network size than existing methods: we achieve a ~254x speed-up over NeRF with similar network size, with only a marginal performance decline. Our project page is at yushuang-wu.github.io/NeRDF.

* Accepted by ICCV2023

Via

Access Paper or Ask Questions

LATR: 3D Lane Detection from Monocular Images with Transformer

Aug 20, 2023

Yueru Luo, Chaoda Zheng, Xu Yan, Tang Kun, Chao Zheng, Shuguang Cui, Zhen Li

Figure 1 for LATR: 3D Lane Detection from Monocular Images with Transformer

Figure 2 for LATR: 3D Lane Detection from Monocular Images with Transformer

Figure 3 for LATR: 3D Lane Detection from Monocular Images with Transformer

Figure 4 for LATR: 3D Lane Detection from Monocular Images with Transformer

Abstract:3D lane detection from monocular images is a fundamental yet challenging task in autonomous driving. Recent advances primarily rely on structural 3D surrogates (e.g., bird's eye view) built from front-view image features and camera parameters. However, the depth ambiguity in monocular images inevitably causes misalignment between the constructed surrogate feature map and the original image, posing a great challenge for accurate lane detection. To address the above issue, we present a novel LATR model, an end-to-end 3D lane detector that uses 3D-aware front-view features without transformed view representation. Specifically, LATR detects 3D lanes via cross-attention based on query and key-value pairs, constructed using our lane-aware query generator and dynamic 3D ground positional embedding. On the one hand, each query is generated based on 2D lane-aware features and adopts a hybrid embedding to enhance lane information. On the other hand, 3D space information is injected as positional embedding from an iteratively-updated 3D ground plane. LATR outperforms previous state-of-the-art methods on both synthetic Apollo, realistic OpenLane and ONCE-3DLanes by large margins (e.g., 11.4 gain in terms of F1 score on OpenLane). Code will be released at https://github.com/JMoonr/LATR .

* Accepted by ICCV2023 (Oral)

Via

Access Paper or Ask Questions