Recent advances in single-image 3D face reconstruction have shown remarkable progress in various applications. Nevertheless, prevailing techniques tend to prioritize the global facial contour and expression, often neglecting the nuanced dynamics of the eye region. In response, we introduce an Eye Landmark Adjustment Module, complemented by a Local Dynamic Loss, designed to capture the dynamic features of the eyes area. Our module allows for flexible adjustment of landmarks, resulting in accurate recreation of various eye states. In this paper, we present a comprehensive evaluation of our approach, conducting extensive experiments on two datasets. The results underscore the superior performance of our approach, highlighting its significant contributions in addressing this particular challenge.
Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space. However, most existing studies were conducted under a supervised setup which cannot scale well while handling various new data. Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored. In this work, we design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features. DA-BEV introduces the idea of query into the domain adaptation framework to derive useful information from image-view and BEV features. It consists of two query-based designs, namely, query-based adversarial learning (QAL) and query-based self-training (QST), which exploits image-view features or BEV features to regularize the adaptation of the other. Extensive experiments show that DA-BEV achieves superior domain adaptive BEV perception performance consistently across multiple datasets and tasks such as 3D object detection and 3D scene segmentation.
Datasets encountered in scientific and engineering applications appear in complex formats (e.g., images, multivariate time series, molecules, video, text strings, networks). Graph theory provides a unifying framework to model such datasets and enables the use of powerful tools that can help analyze, visualize, and extract value from data. In this work, we present PlasmoData.jl, an open-source, Julia framework that uses concepts of graph theory to facilitate the modeling and analysis of complex datasets. The core of our framework is a general data modeling abstraction, which we call a DataGraph. We show how the abstraction and software implementation can be used to represent diverse data objects as graphs and to enable the use of tools from topology, graph theory, and machine learning (e.g., graph neural networks) to conduct a variety of tasks. We illustrate the versatility of the framework by using real datasets: i) an image classification problem using topological data analysis to extract features from the graph model to train machine learning models; ii) a disease outbreak problem where we model multivariate time series as graphs to detect abnormal events; and iii) a technology pathway analysis problem where we highlight how we can use graphs to navigate connectivity. Our discussion also highlights how PlasmoData.jl leverages native Julia capabilities to enable compact syntax, scalable computations, and interfaces with diverse packages.
Black-box query-based attacks constitute significant threats to Machine Learning as a Service (MLaaS) systems since they can generate adversarial examples without accessing the target model's architecture and parameters. Traditional defense mechanisms, such as adversarial training, gradient masking, and input transformations, either impose substantial computational costs or compromise the test accuracy of non-adversarial inputs. To address these challenges, we propose an efficient defense mechanism, PuriDefense, that employs random patch-wise purifications with an ensemble of lightweight purification models at a low level of inference cost. These models leverage the local implicit function and rebuild the natural image manifold. Our theoretical analysis suggests that this approach slows down the convergence of query-based attacks by incorporating randomness into purifications. Extensive experiments on CIFAR-10 and ImageNet validate the effectiveness of our proposed purifier-based defense mechanism, demonstrating significant improvements in robustness against query-based attacks.
Visual place recognition (VPR) is an essential component of robot navigation and localization systems that allows them to identify a place using only image data. VPR is challenging due to the significant changes in a place's appearance driven by different daily illumination, seasonal weather variations and diverse viewpoints. Currently, no single VPR technique excels in every environmental condition, each exhibiting unique benefits and shortcomings, and therefore combining multiple techniques can achieve more reliable VPR performance. Present multi-method approaches either rely on online ground-truth information, which is often not available, or on brute-force technique combination, potentially lowering performance with high variance technique sets. Addressing these shortcomings, we propose a VPR system dubbed Multi-Sequential Information Consistency (MuSIC) which leverages sequential information to select the most cohesive technique on an online per-frame basis. For each technique in a set, MuSIC computes their respective sequential consistencies by analysing the frame-to-frame continuity of their top match candidates, which are then directly compared to select the optimal technique for the current query image. The use of sequential information to select between VPR methods results in an overall VPR performance increase across different benchmark datasets, while avoiding the need for extra ground-truth of the runtime environment.
Within (semi-)automated visual inspection, learning-based approaches for assessing visual defects, including deep neural networks, enable the processing of otherwise small defect patterns in pixel size on high-resolution imagery. The emergence of these often rarely occurring defect patterns explains the general need for labeled data corpora. To not only alleviate this issue but to furthermore advance the current state of the art in unsupervised visual inspection, this contribution proposes a DifferNet-based solution enhanced with attention modules utilizing SENet and CBAM as backbone - AttentDifferNet - to improve the detection and classification capabilities on three different visual inspection and anomaly detection datasets: MVTec AD, InsPLAD-fault, and Semiconductor Wafer. In comparison to the current state of the art, it is shown that AttentDifferNet achieves improved results, which are, in turn, highlighted throughout our quantitative as well as qualitative evaluation, indicated by a general improvement in AUC of 94.34 vs. 92.46, 96.67 vs. 94.69, and 90.20 vs. 88.74%. As our variants to AttentDifferNet show great prospects in the context of currently investigated approaches, a baseline is formulated, emphasizing the importance of attention for anomaly detection.
Earth Mover's Distance (EMD) is an important similarity measure between two distributions, used in computer vision and many other application domains. However, its exact calculation is computationally and memory intensive, which hinders its scalability and applicability for large-scale problems. Various approximate EMD algorithms have been proposed to reduce computational costs, but they suffer lower accuracy and may require additional memory usage or manual parameter tuning. In this paper, we present a novel approach, NNS-EMD, to approximate EMD using Nearest Neighbor Search (NNS), in order to achieve high accuracy, low time complexity, and high memory efficiency. The NNS operation reduces the number of data points compared in each NNS iteration and offers opportunities for parallel processing. We further accelerate NNS-EMD via vectorization on GPU, which is especially beneficial for large datasets. We compare NNS-EMD with both the exact EMD and state-of-the-art approximate EMD algorithms on image classification and retrieval tasks. We also apply NNS-EMD to calculate transport mapping and realize color transfer between images. NNS-EMD can be 44x to 135x faster than the exact EMD implementation, and achieves superior accuracy, speedup, and memory efficiency over existing approximate EMD methods.
Recently, the application of deep learning to change detection (CD) has significantly progressed in remote sensing images. In recent years, CD tasks have mostly used architectures such as CNN and Transformer to identify these changes. However, these architectures have shortcomings in representing boundary details and are prone to false alarms and missed detections under complex lighting and weather conditions. For that, we propose a new network, Siamese Meets Diffusion Network (SMDNet). This network combines the Siam-U2Net Feature Differential Encoder (SU-FDE) and the denoising diffusion implicit model to improve the accuracy of image edge change detection and enhance the model's robustness under environmental changes. First, we propose an innovative SU-FDE module that utilizes shared weight features to capture differences between time series images and identify similarities between features to enhance edge detail detection. Furthermore, we add an attention mechanism to identify key coarse features to improve the model's sensitivity and accuracy. Finally, the diffusion model of progressive sampling is used to fuse key coarse features, and the noise reduction ability of the diffusion model and the advantages of capturing the probability distribution of image data are used to enhance the adaptability of the model in different environments. Our method's combination of feature extraction and diffusion models demonstrates effectiveness in change detection in remote sensing images. The performance evaluation of SMDNet on LEVIR-CD, DSIFN-CD, and CDD datasets yields validated F1 scores of 90.99%, 88.40%, and 88.47%, respectively. This substantiates the advanced capabilities of our model in accurately identifying variations and intricate details.
Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io.
In this paper, we propose leveraging causal generative learning as an interpretable tool for explaining image classifiers. Specifically, we present a generative counterfactual inference approach to study the influence of visual features (i.e., pixels) as well as causal factors through generative learning. To this end, we first uncover the most influential pixels on a classifier's decision by varying the value of a causal attribute via counterfactual inference and computing both Shapely and contrastive explanations for counterfactual images with these different attribute values. We then establish a Monte-Carlo mechanism using the generator of a causal generative model in order to adapt Shapley explainers to produce feature importances for the human-interpretable attributes of a causal dataset in the case where a classifier has been trained exclusively on the images of the dataset. Finally, we present optimization methods for creating counterfactual explanations of classifiers by means of counterfactual inference, proposing straightforward approaches for both differentiable and arbitrary classifiers. We exploit the Morpho-MNIST causal dataset as a case study for exploring our proposed methods for generating counterfacutl explantions. We employ visual explanation methods from OmnixAI open source toolkit to compare them with our proposed methods. By employing quantitative metrics to measure the interpretability of counterfactual explanations, we find that our proposed methods of counterfactual explanation offer more interpretable explanations compared to those generated from OmnixAI. This finding suggests that our methods are well-suited for generating highly interpretable counterfactual explanations on causal datasets.