Abstract:Self-supervised monocular depth estimation has gathered notable interest since it can liberate training from dependency on depth annotations. In monocular video training case, recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance. To tackle this, we try to synthesize more virtual camera views by flow-based video frame interpolation (VFI), termed as temporal augmentation. For multi-frame inference, to sidestep the problem of dynamic objects encountered by explicit geometry-based methods like ManyDepth, we return to the feature fusion paradigm and design a VFI-assisted multi-frame fusion module to align and aggregate multi-frame features, using motion and occlusion information obtained by the flow-based VFI model. Finally, we construct a unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth. In this framework, spatial data augmentation through image affine transformation is incorporated for data diversity, along with a triplet depth consistency loss for regularization. The single- and multi-frame models can share weights, making our framework compact and memory-efficient. Extensive experiments demonstrate that our method can bring significant improvements to current advanced architectures. Source code is available at https://github.com/LiuJF1226/Mono-ViFI.
Abstract:Although recent years have witnessed significant advancements in image editing thanks to the remarkable progress of text-to-image diffusion models, the problem of non-rigid image editing still presents its complexities and challenges. Existing methods often fail to achieve consistent results due to the absence of unique identity characteristics. Thus, learning a personalized identity prior might help with consistency in the edited results. In this paper, we explore a novel task: learning the personalized identity prior for text-based non-rigid image editing. To address the problems in jointly learning prior and editing the image, we present LIPE, a two-stage framework designed to customize the generative model utilizing a limited set of images of the same subject, and subsequently employ the model with learned prior for non-rigid image editing. Experimental results demonstrate the advantages of our approach in various editing scenarios over past related leading methods in qualitative and quantitative ways.
Abstract:Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and thus brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive $\textbf{Fre}$qu$\textbf{e}$ncy truncation to refine the guidance of $\textbf{Diff}$usion models for universal editing tasks ($\textbf{FreeDiff}$). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.
Abstract:In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized effects.
Abstract:In recent years, instruction-based image editing methods have garnered significant attention in image editing. However, despite encompassing a wide range of editing priors, these methods are helpless when handling editing tasks that are challenging to accurately describe through language. We propose InstructBrush, an inversion method for instruction-based image editing methods to bridge this gap. It extracts editing effects from exemplar image pairs as editing instructions, which are further applied for image editing. Two key techniques are introduced into InstructBrush, Attention-based Instruction Optimization and Transformation-oriented Instruction Initialization, to address the limitations of the previous method in terms of inversion effects and instruction generalization. To explore the ability of instruction inversion methods to guide image editing in open scenarios, we establish a TransformationOriented Paired Benchmark (TOP-Bench), which contains a rich set of scenes and editing types. The creation of this benchmark paves the way for further exploration of instruction inversion. Quantitatively and qualitatively, our approach achieves superior performance in editing and is more semantically consistent with the target editing effects.
Abstract:Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient inference. To overcome above challenges, this paper presents DiffFAE, a one-stage and highly-efficient diffusion-based framework tailored for high-fidelity FAE. For high-fidelity query attributes transfer, we adopt Space-sensitive Physical Customization (SPC), which ensures the fidelity and generalization ability by utilizing rendering texture derived from 3D Morphable Model (3DMM). In order to preserve source attributes, we introduce the Region-responsive Semantic Composition (RSC). This module is guided to learn decoupled source-regarding features, thereby better preserving the identity and alleviating artifacts from non-facial attributes such as hair, clothes, and background. We further introduce a consistency regularization for our pipeline to enhance editing controllability by leveraging prior knowledge in the attention matrices of diffusion model. Extensive experiments demonstrate the superiority of DiffFAE over existing methods, achieving state-of-the-art performance in facial appearance editing.
Abstract:Maxwell's equations are a collection of coupled partial differential equations (PDEs) that, together with the Lorentz force law, constitute the basis of classical electromagnetism and electric circuits. Effectively solving Maxwell's equations is crucial in various fields, like electromagnetic scattering and antenna design optimization. Physics-informed neural networks (PINNs) have shown powerful ability in solving PDEs. However, PINNs still struggle to solve Maxwell's equations in heterogeneous media. To this end, we propose a domain-adaptive PINN (da-PINN) to solve inverse problems of Maxwell's equations in heterogeneous media. First, we propose a location parameter of media interface to decompose the whole domain into several sub-domains. Furthermore, the electromagnetic interface conditions are incorporated into a loss function to improve the prediction performance near the interface. Then, we propose a domain-adaptive training strategy for da-PINN. Finally, the effectiveness of da-PINN is verified with two case studies.
Abstract:Adversarial examples bring a considerable security threat to support vector machines (SVMs), especially those used in safety-critical applications. Thus, robustness verification is an essential issue for SVMs, which can provide provable robustness against various kinds of adversary attacks. The evaluation results obtained through the robustness verification can provide a safe guarantee for the use of SVMs. The existing verification method does not often perform well in verifying SVMs with nonlinear kernels. To this end, we propose a method to improve the verification performance for SVMs with nonlinear kernels. We first formalize the adversarial robustness evaluation of SVMs as an optimization problem. Then a lower bound of the original problem is obtained by solving the Lagrangian dual problem of the original problem. Finally, the adversarial robustness of SVMs is evaluated concerning the lower bound. We evaluate the adversarial robustness of SVMs with linear and nonlinear kernels on the MNIST and Fashion-MNIST datasets. The experimental results show that the percentage of provable robustness obtained by our method on the test set is better than that of the state-of-the-art.
Abstract:Accurate mapping of forests is critical for forest management and carbon stocks monitoring. Deep learning is becoming more popular in Earth Observation (EO), however, the availability of reference data limits its potential in wide-area forest mapping. To overcome those limitations, here we introduce contrastive regression into EO based forest mapping and develop a novel semisupervised regression framework for wall-to-wall mapping of continuous forest variables. It combines supervised contrastive regression loss and semi-supervised Cross-Pseudo Regression loss. The framework is demonstrated over a boreal forest site using Copernicus Sentinel-1 and Sentinel-2 imagery for mapping forest tree height. Achieved prediction accuracies are strongly better compared to using vanilla UNet or traditional regression models, with relative RMSE of 15.1% on stand level. We expect that developed framework can be used for modeling other forest variables and EO datasets.