To retrieve images based on their content is one of the most studied topics in the field of computer vision. Nowadays, this problem can be addressed using modern techniques such as feature extraction using machine learning, but over the years different classical methods have been developed. In this paper, we implement a query by example retrieval system for finding paintings in a museum image collection using classic computer vision techniques. Specifically, we study the performance of the color, texture, text and feature descriptors in datasets with different perturbations in the images: noise, overlapping text boxes, color corruption and rotation. We evaluate each of the cases using the Mean Average Precision (MAP) metric, and we obtain results that vary between 0.5 and 1.0 depending on the problem conditions.
Recent contrastive representation learning methods rely on estimating mutual information (MI) between multiple views of an underlying context. E.g., we can derive multiple views of a given image by applying data augmentation, or we can split a sequence into views comprising the past and future of some step in the sequence. Contrastive lower bounds on MI are easy to optimize, but have a strong underestimation bias when estimating large amounts of MI. We propose decomposing the full MI estimation problem into a sum of smaller estimation problems by splitting one of the views into progressively more informed subviews and by applying the chain rule on MI between the decomposed views. This expression contains a sum of unconditional and conditional MI terms, each measuring modest chunks of the total MI, which facilitates approximation via contrastive bounds. To maximize the sum, we formulate a contrastive lower bound on the conditional MI which can be approximated efficiently. We refer to our general approach as Decomposed Estimation of Mutual Information (DEMI). We show that DEMI can capture a larger amount of MI than standard non-decomposed contrastive bounds in a synthetic setting, and learns better representations in a vision domain and for dialogue generation.
Camera localization aims to estimate 6 DoF camera poses from RGB images. Traditional methods detect and match interest points between a query image and a pre-built 3D model. Recent learning-based approaches encode scene structures into a specific convolutional neural network (CNN) and thus are able to predict dense coordinates from RGB images. However, most of them require re-training or re-adaption for a new scene and have difficulties in handling large-scale scenes due to limited network capacity. We present a new method for scene agnostic camera localization using dense scene matching (DSM), where a cost volume is constructed between a query image and a scene. The cost volume and the corresponding coordinates are processed by a CNN to predict dense coordinates. Camera poses can then be solved by PnP algorithms. In addition, our method can be extended to temporal domain, which leads to extra performance boost during testing time. Our scene-agnostic approach achieves comparable accuracy as the existing scene-specific approaches, such as KFNet, on the 7scenes and Cambridge benchmark. This approach also remarkably outperforms state-of-the-art scene-agnostic dense coordinate regression network SANet. The Code is available at https://github.com/Tangshitao/Dense-Scene-Matching.
The dictionary learning problem, representing data as a combination of few atoms, has long stood as a popular method for learning representations in statistics and signal processing. The most popular dictionary learning algorithm alternates between sparse coding and dictionary update steps, and a rich literature has studied its theoretical convergence. The growing popularity of neurally plausible unfolded sparse coding networks has led to the empirical finding that backpropagation through such networks performs dictionary learning. This paper offers the first theoretical proof for these empirical results through PUDLE, a Provable Unfolded Dictionary LEarning method. We highlight the impact of loss, unfolding, and backpropagation on convergence. We discover an implicit acceleration: as a function of unfolding, the backpropagated gradient converges faster and is more accurate than the gradient from alternating minimization. We complement our findings through synthetic and image denoising experiments. The findings support the use of accelerated deep learning optimizers and unfolded networks for dictionary learning.
This research presents an improved real-time face recognition system at a low resolution of 15 pixels with pose and emotion and resolution variations. We have designed our datasets named LRD200 and LRD100, which have been used for training and classification. The face detection part uses the Viola-Jones algorithm, and the face recognition part receives the face image from the face detection part to process it using the Local Binary Pattern Histogram (LBPH) algorithm with preprocessing using contrast limited adaptive histogram equalization (CLAHE) and face alignment. The face database in this system can be updated via our custom-built standalone android app and automatic restarting of the training and recognition process with an updated database. Using our proposed algorithm, a real-time face recognition accuracy of 78.40% at 15 px and 98.05% at 45 px have been achieved using the LRD200 database containing 200 images per person. With 100 images per person in the database (LRD100) the achieved accuracies are 60.60% at 15 px and 95% at 45 px respectively. A facial deflection of about 30 degrees on either side from the front face showed an average face recognition precision of 72.25% - 81.85%. This face recognition system can be employed for law enforcement purposes, where the surveillance camera captures a low-resolution image because of the distance of a person from the camera. It can also be used as a surveillance system in airports, bus stations, etc., to reduce the risk of possible criminal threats.
With the recent surge in the research of vision transformers, they have demonstrated remarkable potential for various challenging computer vision applications, such as image recognition, point cloud classification as well as video understanding. In this paper, we present empirical results for training a stronger video vision transformer on the EPIC-KITCHENS-100 Action Recognition dataset. Specifically, we explore training techniques for video vision transformers, such as augmentations, resolutions as well as initialization, etc. With our training recipe, a single ViViT model achieves the performance of 47.4\% on the validation set of EPIC-KITCHENS-100 dataset, outperforming what is reported in the original paper by 3.4%. We found that video transformers are especially good at predicting the noun in the verb-noun action prediction task. This makes the overall action prediction accuracy of video transformers notably higher than convolutional ones. Surprisingly, even the best video transformers underperform the convolutional networks on the verb prediction. Therefore, we combine the video vision transformers and some of the convolutional video networks and present our solution to the EPIC-KITCHENS-100 Action Recognition competition.
Facial actions are spatio-temporal signals by nature, and therefore their modeling is crucially dependent on the availability of temporal information. In this paper, we focus on inferring such temporal dynamics of facial actions when no explicit temporal information is available, i.e. from still images. We present a novel approach to capture multiple scales of such temporal dynamics, with an application to facial Action Unit (AU) intensity estimation and dimensional affect estimation. In particular, 1) we propose a framework that infers a dynamic representation (DR) from a still image, which captures the bi-directional flow of time within a short time-window centered at the input image; 2) we show that we can train our method without the need of explicitly generating target representations, allowing the network to represent dynamics more broadly; and 3) we propose to apply a multiple temporal scale approach that infers DRs for different window lengths (MDR) from a still image. We empirically validate the value of our approach on the task of frame ranking, and show how our proposed MDR attains state of the art results on BP4D for AU intensity estimation and on SEMAINE for dimensional affect estimation, using only still images at test time.
The vulnerability of deep neural networks (DNNs) for adversarial examples have attracted more attention. Many algorithms are proposed to craft powerful adversarial examples. However, these algorithms modifying the global or local region of pixels without taking into account network explanations. Hence, the perturbations are redundancy and easily detected by human eyes. In this paper, we propose a novel method to generate local region perturbations. The main idea is to find the contributing feature regions (CFRs) of images based on network explanations for perturbations. Due to the network explanations, the perturbations added to the CFRs are more effective than other regions. In our method, a soft mask matrix is designed to represent the CFRs for finely characterizing the contributions of each pixel. Based on this soft mask, we develop a new objective function with inverse temperature to search for optimal perturbations in CFRs. Extensive experiments are conducted on CIFAR-10 and ILSVRC2012, which demonstrate the effectiveness, including attack success rate, imperceptibility,and transferability.
Deep learning methods have achieved promising performance in many areas, but they are still struggling with noisy-labeled images during the training process. Considering that the annotation quality indispensably relies on great expertise, the problem is even more crucial in the medical image domain. How to eliminate the disturbance from noisy labels for segmentation tasks without further annotations is still a significant challenge. In this paper, we introduce our label quality evaluation strategy for deep neural networks automatically assessing the quality of each label, which is not explicitly provided, and training on clean-annotated ones. We propose a solution for network automatically evaluating the relative quality of the labels in the training set and using good ones to tune the network parameters. We also design an overfitting control module to let the network maximally learn from the precise annotations during the training process. Experiments on the public biomedical image segmentation dataset have proved the method outperforms baseline methods and retains both high accuracy and good generalization at different noise levels.
Object classification is a significant task in computer vision. It has become an effective research area as an important aspect of image processing and the building block of image localization, detection, and scene parsing. Object classification from low-quality images is difficult for the variance of object colors, aspect ratios, and cluttered backgrounds. The field of object classification has seen remarkable advancements, with the development of deep convolutional neural networks (DCNNs). Deep neural networks have been demonstrated as very powerful systems for facing the challenge of object classification from high-resolution images, but deploying such object classification networks on the embedded device remains challenging due to the high computational and memory requirements. Using high-quality images often causes high computational and memory complexity, whereas low-quality images can solve this issue. Hence, in this paper, we investigate an optimal architecture that accurately classifies low-quality images using DCNNs architectures. To validate different baselines on lowquality images, we perform experiments using webcam captured image datasets of 10 different objects. In this research work, we evaluate the proposed architecture by implementing popular CNN architectures. The experimental results validate that the MobileNet architecture delivers better than most of the available CNN architectures for low-resolution webcam image datasets.