Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yao Yao

Dennis

MCMat: Multiview-Consistent and Physically Accurate PBR Material Generation

Dec 18, 2024

Shenhao Zhu, Lingteng Qiu, Xiaodong Gu, Zhengyi Zhao, Chao Xu, Yuxiao He, Zhe Li, Xiaoguang Han, Yao Yao, Xun Cao(+4 more)

Figure 1 for MCMat: Multiview-Consistent and Physically Accurate PBR Material Generation

Figure 2 for MCMat: Multiview-Consistent and Physically Accurate PBR Material Generation

Figure 3 for MCMat: Multiview-Consistent and Physically Accurate PBR Material Generation

Figure 4 for MCMat: Multiview-Consistent and Physically Accurate PBR Material Generation

Abstract:Existing 2D methods utilize UNet-based diffusion models to generate multi-view physically-based rendering (PBR) maps but struggle with multi-view inconsistency, while some 3D methods directly generate UV maps, encountering generalization issues due to the limited 3D data. To address these problems, we propose a two-stage approach, including multi-view generation and UV materials refinement. In the generation stage, we adopt a Diffusion Transformer (DiT) model to generate PBR materials, where both the specially designed multi-branch DiT and reference-based DiT blocks adopt a global attention mechanism to promote feature interaction and fusion between different views, thereby improving multi-view consistency. In addition, we adopt a PBR-based diffusion loss to ensure that the generated materials align with realistic physical principles. In the refinement stage, we propose a material-refined DiT that performs inpainting in empty areas and enhances details in UV space. Except for the normal condition, this refinement also takes the material map from the generation stage as an additional condition to reduce the learning difficulty and improve generalization. Extensive experiments show that our method achieves state-of-the-art performance in texturing 3D objects with PBR materials and provides significant advantages for graphics relighting applications. Project Page: https://lingtengqiu.github.io/2024/MCMat/

* Project Page: https://lingtengqiu.github.io/2024/MCMat/

Via

Access Paper or Ask Questions

4D SlingBAG: spatial-temporal coupled Gaussian ball for large-scale dynamic 3D photoacoustic iterative reconstruction

Dec 05, 2024

Shuang Li, Yibing Wang, Jian Gao, Chulhong Kim, Seongwook Choi, Yu Zhang, Qian Chen, Yao Yao, Changhui Li

Abstract:Large-scale dynamic three-dimensional (3D) photoacoustic imaging (PAI) is significantly important in clinical applications. In practical implementations, large-scale 3D real-time PAI systems typically utilize sparse two-dimensional (2D) sensor arrays with certain angular deficiencies, necessitating advanced iterative reconstruction (IR) algorithms to achieve quantitative PAI and reduce reconstruction artifacts. However, for existing IR algorithms, multi-frame 3D reconstruction leads to extremely high memory consumption and prolonged computation time, with limited consideration of the spatial-temporal continuity between data frames. Here, we propose a novel method, named the 4D sliding Gaussian ball adaptive growth (4D SlingBAG) algorithm, based on the current point cloud-based IR algorithm sliding Gaussian ball adaptive growth (SlingBAG), which has minimal memory consumption among IR methods. Our 4D SlingBAG method applies spatial-temporal coupled deformation functions to each Gaussian sphere in point cloud, thus explicitly learning the deformations features of the dynamic 3D PA scene. This allows for the efficient representation of various physiological processes (such as pulsation) or external pressures (e.g., blood perfusion experiments) contributing to changes in vessel morphology and blood flow during dynamic 3D PAI, enabling highly efficient IR for dynamic 3D PAI. Simulation experiments demonstrate that 4D SlingBAG achieves high-quality dynamic 3D PA reconstruction. Compared to performing reconstructions by using SlingBAG algorithm individually for each frame, our method significantly reduces computational time and keeps a extremely low memory consumption. The project for 4D SlingBAG can be found in the following GitHub repository: \href{https://github.com/JaegerCQ/4D-SlingBAG}{https://github.com/JaegerCQ/4D-SlingBAG}.

Via

Access Paper or Ask Questions

FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video

Nov 23, 2024

Jiawei Zhang, Zijian Wu, Zhiyang Liang, Yicheng Gong, Dongfang Hu, Yao Yao, Xun Cao, Hao Zhu

Abstract:Reconstructing high-fidelity, animatable 3D head avatars from effortlessly captured monocular videos is a pivotal yet formidable challenge. Although significant progress has been made in rendering performance and manipulation capabilities, notable challenges remain, including incomplete reconstruction and inefficient Gaussian representation. To address these challenges, we introduce FATE, a novel method for reconstructing an editable full-head avatar from a single monocular video. FATE integrates a sampling-based densification strategy to ensure optimal positional distribution of points, improving rendering efficiency. A neural baking technique is introduced to convert discrete Gaussian representations into continuous attribute maps, facilitating intuitive appearance editing. Furthermore, we propose a universal completion framework to recover non-frontal appearance, culminating in a 360$^\circ$-renderable 3D head avatar. FATE outperforms previous approaches in both qualitative and quantitative evaluations, achieving state-of-the-art performance. To the best of our knowledge, FATE is the first animatable and 360$^\circ$ full-head monocular reconstruction method for a 3D head avatar. The code will be publicly released upon publication.

* project page: https://zjwfufu.github.io/FATE-page/

Via

Access Paper or Ask Questions

Model Developmental Safety: A Safety-Centric Method and Applications in Vision-Language Models

Oct 13, 2024

Gang Li, Wendi Yu, Yao Yao, Wei Tong, Yingbin Liang, Qihang Lin, Tianbao Yang

Figure 1 for Model Developmental Safety: A Safety-Centric Method and Applications in Vision-Language Models

Figure 2 for Model Developmental Safety: A Safety-Centric Method and Applications in Vision-Language Models

Figure 3 for Model Developmental Safety: A Safety-Centric Method and Applications in Vision-Language Models

Figure 4 for Model Developmental Safety: A Safety-Centric Method and Applications in Vision-Language Models

Abstract:In the real world, a learning-enabled system usually undergoes multiple cycles of model development to enhance the system's ability to handle difficult or emerging tasks. This continual model development process raises a significant issue that the model development for acquiring new or improving existing capabilities may inadvertently lose capabilities of the old model, also known as catastrophic forgetting. Existing continual learning studies focus on mitigating catastrophic forgetting by trading off performance on previous tasks and new tasks to ensure good average performance. However, they are inadequate for many applications especially in safety-critical domains, as failure to strictly preserve the performance of the old model not only introduces safety risks and uncertainties but also imposes substantial expenses in the re-improving and re-validation of existing properties. To address this issue, we introduce model developmental safety as a guarantee of a learning system such that in the model development process the new model should strictly preserve the existing protected capabilities of the old model while improving its performance on target tasks. To ensure the model developmental safety, we present a safety-centric framework by formulating the model developmental safety as data-dependent constraints. Under this framework, we study how to develop a pretrained vision-language model (aka the CLIP model) for acquiring new capabilities or improving existing capabilities of image classification. We propose an efficient constrained optimization algorithm with theoretical guarantee and use its insights to finetune a CLIP model with task-dependent heads for promoting the model developmental safety. Our experiments on improving vision perception capabilities on autonomous driving and scene recognition datasets demonstrate the efficacy of the proposed approach.

* 40 pages, 7 figures

Via

Access Paper or Ask Questions

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Oct 10, 2024

Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, Jingdong Wang

Figure 1 for Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Figure 2 for Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Figure 3 for Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Figure 4 for Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Abstract:Recent advances in latent diffusion-based generative models for portrait image animation, such as Hallo, have achieved impressive results in short-duration video synthesis. In this paper, we present updates to Hallo, introducing several design enhancements to extend its capabilities. First, we extend the method to produce long-duration videos. To address substantial challenges such as appearance drift and temporal artifacts, we investigate augmentation strategies within the image space of conditional motion frames. Specifically, we introduce a patch-drop technique augmented with Gaussian noise to enhance visual consistency and temporal coherence over long duration. Second, we achieve 4K resolution portrait video generation. To accomplish this, we implement vector quantization of latent codes and apply temporal alignment techniques to maintain coherence across the temporal dimension. By integrating a high-quality decoder, we realize visual synthesis at 4K resolution. Third, we incorporate adjustable semantic textual labels for portrait expressions as conditional inputs. This extends beyond traditional audio cues to improve controllability and increase the diversity of the generated content. To the best of our knowledge, Hallo2, proposed in this paper, is the first method to achieve 4K resolution and generate hour-long, audio-driven portrait image animations enhanced with textual prompts. We have conducted extensive experiments to evaluate our method on publicly available datasets, including HDTF, CelebV, and our introduced "Wild" dataset. The experimental results demonstrate that our approach achieves state-of-the-art performance in long-duration portrait video animation, successfully generating rich and controllable content at 4K resolution for duration extending up to tens of minutes. Project page https://fudan-generative-vision.github.io/hallo2

Via

Access Paper or Ask Questions

Towards Native Generative Model for 3D Head Avatar

Oct 02, 2024

Yiyu Zhuang, Yuxiao He, Jiawei Zhang, Yanwen Wang, Jiahe Zhu, Yao Yao, Siyu Zhu, Xun Cao, Hao Zhu

Abstract:Creating 3D head avatars is a significant yet challenging task for many applicated scenarios. Previous studies have set out to learn 3D human head generative models using massive 2D image data. Although these models are highly generalizable for human appearance, their result models are not 360$^\circ$-renderable, and the predicted 3D geometry is unreliable. Therefore, such results cannot be used in VR, game modeling, and other scenarios that require 360$^\circ$-renderable 3D head models. An intuitive idea is that 3D head models with limited amount but high 3D accuracy are more reliable training data for a high-quality 3D generative model. In this vein, we delve into how to learn a native generative model for 360$^\circ$ full head from a limited 3D head dataset. Specifically, three major problems are studied: 1) how to effectively utilize various representations for generating the 360$^\circ$-renderable human head; 2) how to disentangle the appearance, shape, and motion of human faces to generate a 3D head model that can be edited by appearance and driven by motion; 3) and how to extend the generalization capability of the generative model to support downstream tasks. Comprehensive experiments are conducted to verify the effectiveness of the proposed model. We hope the proposed models and artist-designed dataset can inspire future research on learning native generative 3D head models from limited 3D datasets.

Via

Access Paper or Ask Questions

Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models

Sep 30, 2024

Luohe Shi, Yao Yao, Zuchao Li, Lefei Zhang, Hai Zhao

Figure 1 for Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models

Figure 2 for Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models

Figure 3 for Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models

Figure 4 for Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models

Abstract:Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and Parameter-Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting LLMs to downstream tasks. ICL typically constructs a few-shot learning scenario, either manually or by setting up a Retrieval-Augmented Generation (RAG) system, helping models quickly grasp domain knowledge or question-answering patterns without changing model parameters. However, this approach involves trade-offs, such as slower inference speed and increased space occupancy. PEFT assists the model in adapting to tasks through minimal parameter modifications, but the training process still demands high hardware requirements, even with a small number of parameters involved. To address these challenges, we propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning, maintaining low inference costs. RTD constructs a reference datastore from the provided training examples and optimizes the LLM's final vocabulary distribution by flexibly selecting suitable references based on the input, resulting in more trustable responses and enabling the model to adapt to downstream tasks at a low cost. Experimental evaluations on various LLMs using different benchmarks demonstrate that RTD establishes a new paradigm for augmenting models to downstream tasks. Furthermore, our method exhibits strong orthogonality with traditional methods, allowing for concurrent usage.

Via

Access Paper or Ask Questions

4D Diffusion for Dynamic Protein Structure Prediction with Reference Guided Motion Alignment

Aug 22, 2024

Kaihui Cheng, Ce Liu, Qingkun Su, Jun Wang, Liwei Zhang, Yining Tang, Yao Yao, Siyu Zhu, Yuan Qi

Figure 1 for 4D Diffusion for Dynamic Protein Structure Prediction with Reference Guided Motion Alignment

Figure 2 for 4D Diffusion for Dynamic Protein Structure Prediction with Reference Guided Motion Alignment

Figure 3 for 4D Diffusion for Dynamic Protein Structure Prediction with Reference Guided Motion Alignment

Figure 4 for 4D Diffusion for Dynamic Protein Structure Prediction with Reference Guided Motion Alignment

Abstract:Protein structure prediction is pivotal for understanding the structure-function relationship of proteins, advancing biological research, and facilitating pharmaceutical development and experimental design. While deep learning methods and the expanded availability of experimental 3D protein structures have accelerated structure prediction, the dynamic nature of protein structures has received limited attention. This study introduces an innovative 4D diffusion model incorporating molecular dynamics (MD) simulation data to learn dynamic protein structures. Our approach is distinguished by the following components: (1) a unified diffusion model capable of generating dynamic protein structures, including both the backbone and side chains, utilizing atomic grouping and side-chain dihedral angle predictions; (2) a reference network that enhances structural consistency by integrating the latent embeddings of the initial 3D protein structures; and (3) a motion alignment module aimed at improving temporal structural coherence across multiple time steps. To our knowledge, this is the first diffusion-based model aimed at predicting protein trajectories across multiple time steps simultaneously. Validation on benchmark datasets demonstrates that our model exhibits high accuracy in predicting dynamic 3D structures of proteins containing up to 256 amino acids over 32 time steps, effectively capturing both local flexibility in stable states and significant conformational changes.

Via

Access Paper or Ask Questions

Deep Joint Denoising and Detection for Enhanced Intracellular Particle Analysis

Aug 15, 2024

Yao Yao, Ihor Smal, Ilya Grigoriev, Anna Akhmanova, Erik Meijering

Figure 1 for Deep Joint Denoising and Detection for Enhanced Intracellular Particle Analysis

Figure 2 for Deep Joint Denoising and Detection for Enhanced Intracellular Particle Analysis

Figure 3 for Deep Joint Denoising and Detection for Enhanced Intracellular Particle Analysis

Figure 4 for Deep Joint Denoising and Detection for Enhanced Intracellular Particle Analysis

Abstract:Reliable analysis of intracellular dynamic processes in time-lapse fluorescence microscopy images requires complete and accurate tracking of all small particles in all time frames of the image sequences. A fundamental first step towards this goal is particle detection. Given the small size of the particles, their detection is greatly affected by image noise. Recent studies have shown that applying image denoising as a preprocessing step indeed improves particle detection and their subsequent tracking. Deep learning based particle detection methods have shown superior results compared to traditional detection methods. However, they do not explicitly aim to remove noise from the images to facilitate detection. Thus we hypothesize that their performance could be further improved. In this paper, we propose a new deep neural network, called DENODET (denoising-detection network), which performs image denoising and particle detection simultaneously. We show that integrative denoising and detection yields more accurate detection results. Our method achieves superior results compared to state-of-the-art particle detection methods on the particle tracking challenge dataset and our own real fluorescence microscopy image data.

* 11 pages, 4 figures, 4 tables

Via

Access Paper or Ask Questions

Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

Aug 05, 2024

Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao

Figure 1 for Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

Figure 2 for Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

Figure 3 for Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

Figure 4 for Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

Abstract:This paper investigates the faithfulness of multimodal large language model (MLLM) agents in the graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general setting is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. A wide range of MLLMs are evaluated as GUI agents using our simulated dataset, following three working patterns with different levels of perception. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions. While recent studies predominantly focus on the helpfulness (i.e., action accuracy) of multimodal agents, our findings indicate that these agents are prone to environmental distractions, resulting in unfaithful behaviors. Furthermore, we switch to the adversarial perspective and implement environment injection, demonstrating that such unfaithfulness can be exploited, leading to unexpected risks.

Via

Access Paper or Ask Questions