We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.
Iris recognition systems, operating in the near infrared spectrum (NIR), have demonstrated vulnerability to presentation attacks, where an adversary uses artifacts such as cosmetic contact lenses, artificial eyes or printed iris images in order to circumvent the system. At the same time, a number of effective presentation attack detection (PAD) methods have been developed. These methods have demonstrated success in detecting artificial eyes (e.g., fake Van Dyke eyes) as presentation attacks. In this work, we seek to alter the optical characteristics of artificial eyes by affixing Vanadium Dioxide (VO2) films on their surface in various spatial configurations. VO2 films can be used to selectively transmit NIR light and can, therefore, be used to regulate the amount of NIR light from the object that is captured by the iris sensor. We study the impact of such images produced by the sensor on two state-of-the-art iris PA detection methods. We observe that the addition of VO2 films on the surface of artificial eyes can cause the PA detection methods to misclassify them as bonafide eyes in some cases. This represents a vulnerability that must be systematically analyzed and effectively addressed.
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.
Cross-spectral face recognition (CFR) is aimed at recognizing individuals, where compared face images stem from different sensing modalities, for example infrared vs. visible. While CFR is inherently more challenging than classical face recognition due to significant variation in facial appearance associated to a modality gap, it is superior in scenarios with limited or challenging illumination, as well as in the presence of presentation attacks. Recent advances in artificial intelligence related to convolutional neural networks (CNNs) have brought to the fore a significant performance improvement in CFR. Motivated by this, the contributions of this survey are three-fold. We provide an overview of CFR, targeted to compare face images captured in different spectra, by firstly formalizing CFR and then presenting concrete related applications. Secondly, we explore suitable spectral bands for recognition and discuss recent CFR-methods, placing emphasis on deep neural networks. In particular we revisit techniques that have been proposed to extract and compare heterogeneous features, as well as datasets. We enumerate strengths and limitations of different spectra and associated algorithms. Finally, we discuss research challenges and future lines of research.
The majority of adversarial attack techniques perform well against deep face recognition when the full knowledge of the system is revealed (\emph{white-box}). However, such techniques act unsuccessfully in the gray-box setting where the face templates are unknown to the attackers. In this work, we propose a similarity-based gray-box adversarial attack (SGADV) technique with a newly developed objective function. SGADV utilizes the dissimilarity score to produce the optimized adversarial example, i.e., similarity-based adversarial attack. This technique applies to both white-box and gray-box attacks against authentication systems that determine genuine or imposter users using the dissimilarity score. To validate the effectiveness of SGADV, we conduct extensive experiments on face datasets of LFW, CelebA, and CelebA-HQ against deep face recognition models of FaceNet and InsightFace in both white-box and gray-box settings. The results suggest that the proposed method significantly outperforms the existing adversarial attack techniques in the gray-box setting. We hence summarize that the similarity-base approaches to develop the adversarial example could satisfactorily cater to the gray-box attack scenarios for de-authentication.
Convolutional Neural Networks (CNNs) are being increasingly used to address the problem of iris presentation attack detection. In this work, we propose attention-guided iris presentation attack detection (AG-PAD) to augment CNNs with attention mechanisms. Two types of attention modules are independently appended on top of the last convolutional layer of the backbone network. Specifically, the channel attention module is used to model the inter-channel relationship between features, while the position attention module is used to model inter-spatial relationship between features. An element-wise sum is employed to fuse these two attention modules. Further, a novel hierarchical attention mechanism is introduced. Experiments involving both a JHU-APL proprietary dataset and the benchmark LivDet-Iris-2017 dataset suggest that the proposed method achieves promising results. To the best of our knowledge, this is the first work that exploits the use of attention mechanisms in iris presentation attack detection.
Launched in 2013, LivDet-Iris is an international competition series open to academia and industry with the aim to assess and report advances in iris Presentation Attack Detection (PAD). This paper presents results from the fourth competition of the series: LivDet-Iris 2020. This year's competition introduced several novel elements: (a) incorporated new types of attacks (samples displayed on a screen, cadaver eyes and prosthetic eyes), (b) initiated LivDet-Iris as an on-going effort, with a testing protocol available now to everyone via the Biometrics Evaluation and Testing (BEAT)(https://www.idiap.ch/software/beat/) open-source platform to facilitate reproducibility and benchmarking of new algorithms continuously, and (c) performance comparison of the submitted entries with three baseline methods (offered by the University of Notre Dame and Michigan State University), and three open-source iris PAD methods available in the public domain. The best performing entry to the competition reported a weighted average APCER of 59.10\% and a BPCER of 0.46\% over all five attack types. This paper serves as the latest evaluation of iris PAD on a large spectrum of presentation attack instruments.
The need for reliably determining the identity of a person is critical in a number of different domains ranging from personal smartphones to border security; from autonomous vehicles to e-voting; from tracking child vaccinations to preventing human trafficking; from crime scene investigation to personalization of customer service. Biometrics, which entails the use of biological attributes such as face, fingerprints and voice for recognizing a person, is being increasingly used in several such applications. While biometric technology has made rapid strides over the past decade, there are several fundamental issues that are yet to be satisfactorily resolved. In this article, we will discuss some of these issues and enumerate some of the exciting challenges in this field.
Designing face recognition systems that are capable of matching face images obtained in the thermal spectrum with those obtained in the visible spectrum is a challenging problem. In this work, we propose the use of semantic-guided generative adversarial network (SG-GAN) to automatically synthesize visible face images from their thermal counterparts. Specifically, semantic labels, extracted by a face parsing network, are used to compute a semantic loss function to regularize the adversarial network during training. These semantic cues denote high-level facial component information associated with each pixel. Further, an identity extraction network is leveraged to generate multi-scale features to compute an identity loss function. To achieve photo-realistic results, a perceptual loss function is introduced during network training to ensure that the synthesized visible face is perceptually similar to the target visible face image. We extensively evaluate the benefits of individual loss functions, and combine them effectively to learn the mapping from thermal to visible face images. Experiments involving two multispectral face datasets show that the proposed method achieves promising results in both face synthesis and cross-spectral face matching.