Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:photo

Ingredients: Blending Custom Photos with Video Diffusion Transformers

Jan 03, 2025

Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, Mingyuan Fan

Figure 1 for Ingredients: Blending Custom Photos with Video Diffusion Transformers

Figure 2 for Ingredients: Blending Custom Photos with Video Diffusion Transformers

Figure 3 for Ingredients: Blending Custom Photos with Video Diffusion Transformers

Figure 4 for Ingredients: Blending Custom Photos with Video Diffusion Transformers

Abstract:This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as \texttt{Ingredients}. Generally, our method consists of three primary modules: (\textbf{i}) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (\textbf{ii}) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (\textbf{iii}) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions. Leveraging a meticulously curated text-video dataset and a multi-stage training protocol, \texttt{Ingredients} demonstrates superior performance in turning custom photos into dynamic and personalized video content. Qualitative evaluations highlight the advantages of proposed method, positioning it as a significant advancement toward more effective generative video control tools in Transformer-based architecture, compared to existing methods. The data, code, and model weights are publicly available at: \url{https://github.com/feizc/Ingredients}.

Via

Access Paper or Ask Questions

DreamDrive: Generative 4D Scene Modeling from Street View Images

Jan 03, 2025

Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, Yue Wang

Figure 1 for DreamDrive: Generative 4D Scene Modeling from Street View Images

Figure 2 for DreamDrive: Generative 4D Scene Modeling from Street View Images

Figure 3 for DreamDrive: Generative 4D Scene Modeling from Street View Images

Figure 4 for DreamDrive: Generative 4D Scene Modeling from Street View Images

Abstract:Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.

* Project page: https://pointscoder.github.io/DreamDrive/

Via

Access Paper or Ask Questions

A Novel Approach using CapsNet and Deep Belief Network for Detection and Identification of Oral Leukopenia

Jan 01, 2025

Hirthik Mathesh GV, Kavin Chakravarthy M, Sentil Pandi S

Figure 1 for A Novel Approach using CapsNet and Deep Belief Network for Detection and Identification of Oral Leukopenia

Figure 2 for A Novel Approach using CapsNet and Deep Belief Network for Detection and Identification of Oral Leukopenia

Figure 3 for A Novel Approach using CapsNet and Deep Belief Network for Detection and Identification of Oral Leukopenia

Abstract:Oral cancer constitutes a significant global health concern, resulting in 277,484 fatalities in 2023, with the highest prevalence observed in low- and middle-income nations. Facilitating automation in the detection of possibly malignant and malignant lesions in the oral cavity could result in cost-effective and early disease diagnosis. Establishing an extensive repository of meticulously annotated oral lesions is essential. In this research photos are being collected from global clinical experts, who have been equipped with an annotation tool to generate comprehensive labelling. This research presents a novel approach for integrating bounding box annotations from various doctors. Additionally, Deep Belief Network combined with CAPSNET is employed to develop automated systems that extracted intricate patterns to address this challenging problem. This study evaluated two deep learning-based computer vision methodologies for the automated detection and classification of oral lesions to facilitate the early detection of oral cancer: image classification utilizing CAPSNET. Image classification attained an F1 score of 94.23% for detecting photos with lesions 93.46% for identifying images necessitating referral. Object detection attained an F1 score of 89.34% for identifying lesions for referral. Subsequent performances are documented about classification based on the sort of referral decision. Our preliminary findings indicate that deep learning possesses the capability to address this complex problem.

* Accepted to IEEE International Conference on Advancement in Communication and Computing Technology (INOACC), will be held in Sai Vidya Institute of Technology, Bengaluru, Karnataka, India. (Preprint)

Via

Access Paper or Ask Questions

UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

Dec 30, 2024

Fangwei Zhong, Kui Wu, Churan Wang, Hao Chen, Hai Ci, Zhoujun Li, Yizhou Wang

Figure 1 for UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

Figure 2 for UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

Figure 3 for UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

Figure 4 for UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

Abstract:We introduce UnrealZoo, a rich collection of photo-realistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of the open worlds. Additionally, we offer a variety of playable entities for embodied AI agents. Based on UnrealCV, we provide a suite of easy-to-use Python APIs and tools for various potential applications, such as data collection, environment augmentation, distributed training, and benchmarking. We optimize the rendering and communication efficiency of UnrealCV to support advanced applications, such as multi-agent interaction. Our experiments benchmark agents in various complex scenes, focusing on visual navigation and tracking, which are fundamental capabilities for embodied visual intelligence. The results yield valuable insights into the advantages of diverse training environments for reinforcement learning (RL) agents and the challenges faced by current embodied vision agents, including those based on RL and large vision-language models (VLMs), in open worlds. These challenges involve latency in closed-loop control in dynamic scenes and reasoning about 3D spatial structures in unstructured terrain.

* Project page: http://unrealzoo.site/

Via

Access Paper or Ask Questions

Protégé: Learn and Generate Basic Makeup Styles with Generative Adversarial Networks (GANs)

Dec 29, 2024

Jia Wei Sii, Chee Seng Chan

Figure 1 for Protégé: Learn and Generate Basic Makeup Styles with Generative Adversarial Networks (GANs)

Figure 2 for Protégé: Learn and Generate Basic Makeup Styles with Generative Adversarial Networks (GANs)

Figure 3 for Protégé: Learn and Generate Basic Makeup Styles with Generative Adversarial Networks (GANs)

Figure 4 for Protégé: Learn and Generate Basic Makeup Styles with Generative Adversarial Networks (GANs)

Abstract:Makeup is no longer confined to physical application; people now use mobile apps to digitally apply makeup to their photos, which they then share on social media. However, while this shift has made makeup more accessible, designing diverse makeup styles tailored to individual faces remains a challenge. This challenge currently must still be done manually by humans. Existing systems, such as makeup recommendation engines and makeup transfer techniques, offer limitations in creating innovative makeups for different individuals "intuitively" -- significant user effort and knowledge needed and limited makeup options available in app. Our motivation is to address this challenge by proposing Prot\'eg\'e, a new makeup application, leveraging recent generative model -- GANs to learn and automatically generate makeup styles. This is a task that existing makeup applications (i.e., makeup recommendation systems using expert system and makeup transfer methods) are unable to perform. Extensive experiments has been conducted to demonstrate the capability of Prot\'eg\'e in learning and creating diverse makeups, providing a convenient and intuitive way, marking a significant leap in digital makeup technology!

* 8 pages, 5 figures

Via

Access Paper or Ask Questions

Dust to Tower: Coarse-to-Fine Photo-Realistic Scene Reconstruction from Sparse Uncalibrated Images

Dec 27, 2024

Xudong Cai, Yongcai Wang, Zhaoxin Fan, Deng Haoran, Shuo Wang, Wanting Li, Deying Li, Lun Luo, Minhang Wang, Jintao Xu

Figure 1 for Dust to Tower: Coarse-to-Fine Photo-Realistic Scene Reconstruction from Sparse Uncalibrated Images

Figure 2 for Dust to Tower: Coarse-to-Fine Photo-Realistic Scene Reconstruction from Sparse Uncalibrated Images

Figure 3 for Dust to Tower: Coarse-to-Fine Photo-Realistic Scene Reconstruction from Sparse Uncalibrated Images

Figure 4 for Dust to Tower: Coarse-to-Fine Photo-Realistic Scene Reconstruction from Sparse Uncalibrated Images

Abstract:Photo-realistic scene reconstruction from sparse-view, uncalibrated images is highly required in practice. Although some successes have been made, existing methods are either Sparse-View but require accurate camera parameters (i.e., intrinsic and extrinsic), or SfM-free but need densely captured images. To combine the advantages of both methods while addressing their respective weaknesses, we propose Dust to Tower (D2T), an accurate and efficient coarse-to-fine framework to optimize 3DGS and image poses simultaneously from sparse and uncalibrated images. Our key idea is to first construct a coarse model efficiently and subsequently refine it using warped and inpainted images at novel viewpoints. To do this, we first introduce a Coarse Construction Module (CCM) which exploits a fast Multi-View Stereo model to initialize a 3D Gaussian Splatting (3DGS) and recover initial camera poses. To refine the 3D model at novel viewpoints, we propose a Confidence Aware Depth Alignment (CADA) module to refine the coarse depth maps by aligning their confident parts with estimated depths by a Mono-depth model. Then, a Warped Image-Guided Inpainting (WIGI) module is proposed to warp the training images to novel viewpoints by the refined depth maps, and inpainting is applied to fulfill the ``holes" in the warped images caused by view-direction changes, providing high-quality supervision to further optimize the 3D model and the camera poses. Extensive experiments and ablation studies demonstrate the validity of D2T and its design choices, achieving state-of-the-art performance in both tasks of novel view synthesis and pose estimation while keeping high efficiency. Codes will be publicly available.

Via

Access Paper or Ask Questions

DRDM: A Disentangled Representations Diffusion Model for Synthesizing Realistic Person Images

Dec 25, 2024

Enbo Huang, Yuan Zhang, Faliang Huang, Guangyu Zhang, Yang Liu

Abstract:Person image synthesis with controllable body poses and appearances is an essential task owing to the practical needs in the context of virtual try-on, image editing and video production. However, existing methods face significant challenges with details missing, limbs distortion and the garment style deviation. To address these issues, we propose a Disentangled Representations Diffusion Model (DRDM) to generate photo-realistic images from source portraits in specific desired poses and appearances. First, a pose encoder is responsible for encoding pose features into a high-dimensional space to guide the generation of person images. Second, a body-part subspace decoupling block (BSDB) disentangles features from the different body parts of a source figure and feeds them to the various layers of the noise prediction block, thereby supplying the network with rich disentangled features for generating a realistic target image. Moreover, during inference, we develop a parsing map-based disentangled classifier-free guided sampling method, which amplifies the conditional signals of texture and pose. Extensive experimental results on the Deepfashion dataset demonstrate the effectiveness of our approach in achieving pose transfer and appearance control.

Via

Access Paper or Ask Questions

Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction

Dec 25, 2024

Dapeng Zhao, Yue Qi

Figure 1 for Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction

Figure 2 for Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction

Figure 3 for Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction

Figure 4 for Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction

Abstract:Single-view 3D face reconstruction is a fundamental Computer Vision problem of extraordinary difficulty. Current systems often assume the input is unobstructed faces which makes their method not suitable for in-the-wild conditions. We present a method for performing a 3D face that removes eyeglasses from a single image. Existing facial reconstruction methods fail to remove eyeglasses automatically for generating a photo-realistic 3D face "in-the-wild".The innovation of our method lies in a process for identifying the eyeglasses area robustly and remove it intelligently. In this work, we estimate the 2D face structure of the reasonable position of the eyeglasses area, which is used for the construction of 3D texture. An excellent anti-eyeglasses face reconstruction method should ensure the authenticity of the output, including the topological structure between the eyes, nose, and mouth. We achieve this via a deep learning architecture that performs direct regression of a 3DMM representation of the 3D facial geometry from a single 2D image. We also demonstrate how the related face parsing task can be incorporated into the proposed framework and help improve reconstruction quality. We conduct extensive experiments on existing 3D face reconstruction tasks as concrete examples to demonstrate the method's superior regulation ability over existing methods often break down.

* arXiv admin note: text overlap with arXiv:2412.18920

Via

Access Paper or Ask Questions

LatentCRF: Continuous CRF for Efficient Latent Diffusion

Dec 24, 2024

Kanchana Ranasinghe, Sadeep Jayasumana, Andreas Veit, Ayan Chakrabarti, Daniel Glasner, Michael S Ryoo, Srikumar Ramalingam, Sanjiv Kumar

Figure 1 for LatentCRF: Continuous CRF for Efficient Latent Diffusion

Figure 2 for LatentCRF: Continuous CRF for Efficient Latent Diffusion

Figure 3 for LatentCRF: Continuous CRF for Efficient Latent Diffusion

Figure 4 for LatentCRF: Continuous CRF for Efficient Latent Diffusion

Abstract:Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images, however, the latency incurred by multiple costly inference iterations can restrict their applicability. We introduce LatentCRF, a continuous Conditional Random Field (CRF) model, implemented as a neural network layer, that models the spatial and semantic relationships among the latent vectors in the LDM. By replacing some of the computationally-intensive LDM inference iterations with our lightweight LatentCRF, we achieve a superior balance between quality, speed and diversity. We increase inference efficiency by 33% with no loss in image quality or diversity compared to the full LDM. LatentCRF is an easy add-on, which does not require modifying the LDM.

Via

Access Paper or Ask Questions

Adversarial Attack Against Images Classification based on Generative Adversarial Networks

Dec 24, 2024

Yahe Yang

Figure 1 for Adversarial Attack Against Images Classification based on Generative Adversarial Networks

Figure 2 for Adversarial Attack Against Images Classification based on Generative Adversarial Networks

Figure 3 for Adversarial Attack Against Images Classification based on Generative Adversarial Networks

Figure 4 for Adversarial Attack Against Images Classification based on Generative Adversarial Networks

Abstract:Adversarial attacks on image classification systems have always been an important problem in the field of machine learning, and generative adversarial networks (GANs), as popular models in the field of image generation, have been widely used in various novel scenarios due to their powerful generative capabilities. However, with the popularity of generative adversarial networks, the misuse of fake image technology has raised a series of security problems, such as malicious tampering with other people's photos and videos, and invasion of personal privacy. Inspired by the generative adversarial networks, this work proposes a novel adversarial attack method, aiming to gain insight into the weaknesses of the image classification system and improve its anti-attack ability. Specifically, the generative adversarial networks are used to generate adversarial samples with small perturbations but enough to affect the decision-making of the classifier, and the adversarial samples are generated through the adversarial learning of the training generator and the classifier. From extensive experiment analysis, we evaluate the effectiveness of the method on a classical image classification dataset, and the results show that our model successfully deceives a variety of advanced classifiers while maintaining the naturalness of adversarial samples.

* 7 pages, 6 figures

Via

Access Paper or Ask Questions

Topic:photo

Papers and Code