Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ziwei Liu

Nanyang Technological University

Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Sep 25, 2024

Hui En Pang, Shuai Liu, Zhongang Cai, Lei Yang, Tianwei Zhang, Ziwei Liu

Figure 1 for Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Figure 2 for Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Figure 3 for Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Figure 4 for Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Abstract:We present \textbf{Disco4D}, a novel Gaussian Splatting framework for 4D human generation and animation from a single image. Different from existing methods, Disco4D distinctively disentangles clothings (with Gaussian models) from the human body (with SMPL-X model), significantly enhancing the generation details and flexibility. It has the following technical innovations. \textbf{1)} Disco4D learns to efficiently fit the clothing Gaussians over the SMPL-X Gaussians. \textbf{2)} It adopts diffusion models to enhance the 3D generation process, \textit{e.g.}, modeling occluded parts not visible in the input image. \textbf{3)} It learns an identity encoding for each clothing Gaussian to facilitate the separation and extraction of clothing assets. Furthermore, Disco4D naturally supports 4D human animation with vivid dynamics. Extensive experiments demonstrate the superiority of Disco4D on 4D human generation and animation tasks. Our visualizations can be found in \url{https://disco-4d.github.io/}.

Via

Access Paper or Ask Questions

GroupDiff: Diffusion-based Group Portrait Editing

Sep 22, 2024

Yuming Jiang, Nanxuan Zhao, Qing Liu, Krishna Kumar Singh, Shuai Yang, Chen Change Loy, Ziwei Liu

Figure 1 for GroupDiff: Diffusion-based Group Portrait Editing

Figure 2 for GroupDiff: Diffusion-based Group Portrait Editing

Figure 3 for GroupDiff: Diffusion-based Group Portrait Editing

Figure 4 for GroupDiff: Diffusion-based Group Portrait Editing

Abstract:Group portrait editing is highly desirable since users constantly want to add a person, delete a person, or manipulate existing persons. It is also challenging due to the intricate dynamics of human interactions and the diverse gestures. In this work, we present GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions: 1) Data Engine: Since there is no labeled data for group photo editing, we create a data engine to generate paired data for training. The training data engine covers the diverse needs of group portrait editing. 2) Appearance Preservation: To keep the appearance consistent after editing, we inject the images of persons from the group photo into the attention modules and employ skeletons to provide intra-person guidance. 3) Control Flexibility: Bounding boxes indicating the locations of each person are used to reweight the attention matrix so that the features of each person can be injected into the correct places. This inter-person guidance provides flexible manners for manipulation. Extensive experiments demonstrate that GroupDiff exhibits state-of-the-art performance compared to existing methods. GroupDiff offers controllability for editing and maintains the fidelity of the original photos.

* ECCV 2024

Via

Access Paper or Ask Questions

Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Sep 17, 2024

Zhenwei Wang, Tengfei Wang, Zexin He, Gerhard Hancke, Ziwei Liu, Rynson W. H. Lau

Figure 1 for Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Figure 2 for Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Figure 3 for Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Figure 4 for Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Abstract:In 3D modeling, designers often use an existing 3D model as a reference to create new ones. This practice has inspired the development of Phidias, a novel generative model that uses diffusion for reference-augmented 3D generation. Given an image, our method leverages a retrieved or user-provided 3D reference model to guide the generation process, thereby enhancing the generation quality, generalization ability, and controllability. Our model integrates three key components: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Collectively, these designs result in a clear improvement over existing methods. Phidias establishes a unified framework for 3D generation using text, image, and 3D conditions with versatile applications.

* Project page: https://RAG-3D.github.io/

Via

Access Paper or Ask Questions

SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality

Sep 12, 2024

Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang

Figure 1 for SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality

Figure 2 for SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality

Figure 3 for SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality

Figure 4 for SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality

Abstract:Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vision foundation models trained on natural RGB images to other image modalities of different physical properties (e.g., polarization). SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained foundation model. We apply SimMAT to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new image modality. Given the absence of relevant benchmarks, we construct a new benchmark to evaluate the transfer learning performance. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. Specifically, SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. We hope that SimMAT can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.

* Github link: https://github.com/mt-cly/SimMAT

Via

Access Paper or Ask Questions

Transmissive RIS Enabled Transceiver Systems:Architecture, Design Issues and Opportunities

Aug 24, 2024

Zhendong Li, Wen Chen, Qingqing Wu, Ziwei Liu, Chong He, Xudong Bai, Jun Li

Figure 1 for Transmissive RIS Enabled Transceiver Systems:Architecture, Design Issues and Opportunities

Figure 2 for Transmissive RIS Enabled Transceiver Systems:Architecture, Design Issues and Opportunities

Figure 3 for Transmissive RIS Enabled Transceiver Systems:Architecture, Design Issues and Opportunities

Figure 4 for Transmissive RIS Enabled Transceiver Systems:Architecture, Design Issues and Opportunities

Abstract:Reconfigurable intelligent surface (RIS) is anticipated to augment the performance of beyond fifth-generation (B5G) and sixth-generation (6G) networks by intelligently manipulating the state of its components. Rather than employing reflective RIS for aided communications, this paper proposes an innovative transmissive RIS-enabled transceiver (TRTC) architecture that can accomplish the functions of traditional multi-antenna systems in a cost-effective and energy-efficient manner. First, the proposed network architecture and its corresponding transmission scheme are elaborated from the perspectives of downlink (DL) and uplink (UL) transmissions. Then, we illustrate several significant advantages and differences of TRTC compared to other multiantenna systems. Furthermore, the downlink modulation and extraction principle based on time-modulation array (TMA) is introduced in detail to tackle the multi-stream communications. Moreover, a near-far field channel model appropriate for this architecture is proposed. Based on the channel model, we summarize some state-of-the-art channel estimation schemes, and the channel estimation scheme of TRTC is also provided. Considering the optimization for DL and UL communications, we present numerical simulations that confirm the superiority of the proposed optimization algorithm. Lastly, numerous prospective research avenues for TRTC systems are delineated to inspire further exploration.

* IEEE VTM, 2024

Via

Access Paper or Ask Questions

LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Aug 23, 2024

Shuai Yang, Jing Tan, Mengchen Zhang, Tong Wu, Yixuan Li, Gordon Wetzstein, Ziwei Liu, Dahua Lin

Figure 1 for LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Figure 2 for LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Figure 3 for LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Figure 4 for LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Abstract:3D immersive scene generation is a challenging yet critical task in computer vision and graphics. A desired virtual 3D scene should 1) exhibit omnidirectional view consistency, and 2) allow for free exploration in complex scene hierarchies. Existing methods either rely on successive scene expansion via inpainting or employ panorama representation to represent large FOV scene environments. However, the generated scene suffers from semantic drift during expansion and is unable to handle occlusion among scene hierarchies. To tackle these challenges, we introduce LayerPano3D, a novel framework for full-view, explorable panoramic 3D scene generation from a single text prompt. Our key insight is to decompose a reference 2D panorama into multiple layers at different depth levels, where each layer reveals the unseen space from the reference views via diffusion prior. LayerPano3D comprises multiple dedicated designs: 1) we introduce a novel text-guided anchor view synthesis pipeline for high-quality, consistent panorama generation. 2) We pioneer the Layered 3D Panorama as underlying representation to manage complex scene hierarchies and lift it into 3D Gaussians to splat detailed 360-degree omnidirectional scenes with unconstrained viewing paths. Extensive experiments demonstrate that our framework generates state-of-the-art 3D panoramic scene in both full view consistency and immersive exploratory experience. We believe that LayerPano3D holds promise for advancing 3D panoramic scene creation with numerous applications.

* Project page: https://ys-imtech.github.io/projects/LayerPano3D/

Via

Access Paper or Ask Questions

Bidirectional Gated Mamba for Sequential Recommendation

Aug 21, 2024

Ziwei Liu, Qidong Liu, Yejing Wang, Wanyu Wang, Pengyue Jia, Maolin Wang, Zitao Liu, Yi Chang, Xiangyu Zhao

Figure 1 for Bidirectional Gated Mamba for Sequential Recommendation

Figure 2 for Bidirectional Gated Mamba for Sequential Recommendation

Figure 3 for Bidirectional Gated Mamba for Sequential Recommendation

Figure 4 for Bidirectional Gated Mamba for Sequential Recommendation

Abstract:In various domains, Sequential Recommender Systems (SRS) have become essential due to their superior capability to discern intricate user preferences. Typically, SRS utilize transformer-based architectures to forecast the subsequent item within a sequence. Nevertheless, the quadratic computational complexity inherent in these models often leads to inefficiencies, hindering the achievement of real-time recommendations. Mamba, a recent advancement, has exhibited exceptional performance in time series prediction, significantly enhancing both efficiency and accuracy. However, integrating Mamba directly into SRS poses several challenges. Its inherently unidirectional nature may constrain the model's capacity to capture the full context of user-item interactions, while its instability in state estimation can compromise its ability to detect short-term patterns within interaction sequences. To overcome these issues, we introduce a new framework named \textbf{\underline{S}}elect\textbf{\underline{I}}ve \textbf{\underline{G}}ated \textbf{\underline{MA}}mba (SIGMA). This framework leverages a Partially Flipped Mamba (PF-Mamba) to construct a bidirectional architecture specifically tailored to improve contextual modeling. Additionally, an input-sensitive Dense Selective Gate (DS Gate) is employed to optimize directional weights and enhance the processing of sequential information in PF-Mamba. For short sequence modeling, we have also developed a Feature Extract GRU (FE-GRU) to efficiently capture short-term dependencies. Empirical results indicate that SIGMA outperforms current models on five real-world datasets. Our implementation code is available at \url{https://github.com/ziwliu-cityu/SIMGA} to ease reproducibility.

Via

Access Paper or Ask Questions

Movable Antenna Enabled Symbiotic Radio Systems: An Opportunity for Mutualism

Aug 11, 2024

Chao Zhou, Bin Lyu, Changsheng You, Ziwei Liu

Figure 1 for Movable Antenna Enabled Symbiotic Radio Systems: An Opportunity for Mutualism

Figure 2 for Movable Antenna Enabled Symbiotic Radio Systems: An Opportunity for Mutualism

Figure 3 for Movable Antenna Enabled Symbiotic Radio Systems: An Opportunity for Mutualism

Figure 4 for Movable Antenna Enabled Symbiotic Radio Systems: An Opportunity for Mutualism

Abstract:In this letter, we propose a new movable antenna (MA) enabled symbiotic radio (SR) system that leverages the movement of MAs to maximize both the primary and secondary rates, thereby promoting their mutualism. Specifically, the primary transmitter (PT) equipped with MAs utilizes a maximum ratio transmission (MRT) beamforming scheme to ensure the highest primary rate at the primary user (PU). Concurrently, the backscatter device (BD) establishes the secondary transmission by overlaying onto the primary signal. The utilization of MAs aims to enhance the secondary rate by optimizing the positions of MAs to improve the beam gain at the BD. Accordingly, the beam gains for both MA and fixed-position antenna (FPA) scenarios are analyzed, confirming the effectiveness of the MA scheme in achieving the highest primary and secondary rates. Numerical results verity the superiority of our proposed MA enabled scheme.

* 5 pages, 5 figures. Accepted to IEEE Wireless Communications Letters

Via

Access Paper or Ask Questions

LLaVA-OneVision: Easy Visual Task Transfer

Aug 06, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

Figure 1 for LLaVA-OneVision: Easy Visual Task Transfer

Figure 2 for LLaVA-OneVision: Easy Visual Task Transfer

Figure 3 for LLaVA-OneVision: Easy Visual Task Transfer

Figure 4 for LLaVA-OneVision: Easy Visual Task Transfer

Abstract:We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

* Project Homepage: https://llava-vl.github.io/blog/2024-08-05-llava-onevision/

Via

Access Paper or Ask Questions

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Aug 06, 2024

Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu(+3 more)

Figure 1 for ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Figure 2 for ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Figure 3 for ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Figure 4 for ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Abstract:Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping. Resources can be found at https://guanjz20.github.io/projects/ReSyncer.

* Accepted to European Conference on Computer Vision (ECCV), 2024. Project page: https://guanjz20.github.io/projects/ReSyncer

Via

Access Paper or Ask Questions