With the growth in social media, there is a huge amount of images of faces available on the internet. Often, people use other people's pictures on their own profile. Perceptual hashing is often used to detect whether two images are identical. Therefore, it can be used to detect whether people are misusing others' pictures. In perceptual hashing, a hash is calculated for a given image, and a new test image is mapped to one of the existing hashes if duplicate features are present. Therefore, it can be used as an image filter to flag banned image content or adversarial attacks --which are modifications that are made on purpose to deceive the filter-- even though the content might be changed to deceive the filters. For this reason, it is critical for perceptual hashing to be robust enough to take transformations such as resizing, cropping, and slight pixel modifications into account. In this paper, we would like to propose to experiment with effect of gaussian blurring in perceptual hashing for detecting misuse of personal images specifically for face images. We hypothesize that use of gaussian blurring on the image before calculating its hash will increase the accuracy of our filter that detects adversarial attacks which consist of image cropping, adding text annotation, and image rotation.
Edge offloading for deep neural networks (DNNs) can be adaptive to the input's complexity by using early-exit DNNs. These DNNs have side branches throughout their architecture, allowing the inference to end earlier in the edge. The branches estimate the accuracy for a given input. If this estimated accuracy reaches a threshold, the inference ends on the edge. Otherwise, the edge offloads the inference to the cloud to process the remaining DNN layers. However, DNNs for image classification deals with distorted images, which negatively impact the branches' estimated accuracy. Consequently, the edge offloads more inferences to the cloud. This work introduces expert side branches trained on a particular distortion type to improve robustness against image distortion. The edge detects the distortion type and selects appropriate expert branches to perform the inference. This approach increases the estimated accuracy on the edge, improving the offloading decisions. We validate our proposal in a realistic scenario, in which the edge offloads DNN inference to Amazon EC2 instances.
A key challenge in large-scale image retrieval problems is the trade-off between scalability and accuracy. Recent research has made great strides to improve scalability with compact global image features, and accuracy with local image features. In this work, our main contribution is to unify global and local image features into a single deep model, enabling scalable retrieval with high accuracy. We refer to the new model as DELG, standing for DEep Local and Global features. We leverage lessons from recent feature learning work and propose a model that combines generalized mean pooling for global features and attentive selection for local features. The entire network can be learned end-to-end by carefully balancing the gradient flow between two heads -- requiring only image-level labels. We also introduce an autoencoder-based dimensionality reduction technique for local features, which is integrated into the model, improving training efficiency and matching performance. Experiments on the Revisited Oxford and Paris datasets demonstrate that our jointly learned ResNet-50 based features outperform all previous results using deep global features (most with heavier backbones) and those that further re-rank with local features. Code and models will be released.
Point cloud is an efficient way of representing and storing 3D geometric data. Deep learning algorithms on point clouds are time and memory efficient. Several methods such as PointNet and FoldingNet have been proposed for processing point clouds. This work proposes an autoencoder based framework to generate robust embeddings for point clouds by utilizing hierarchical information using graph convolution. We perform multiple experiments to assess the quality of embeddings generated by the proposed encoder architecture and visualize the t-SNE map to highlight its ability to distinguish between different object classes. We further demonstrate the applicability of the proposed framework in applications like: 3D point cloud completion and Single image based 3D reconstruction.
With the identity information in face data more closely related to personal credit and property security, people pay increasing attention to the protection of face data privacy. In different tasks, people have various requirements for face de-identification (De-ID), so we propose a systematical solution compatible for these De-ID operations. Firstly, an attribute disentanglement and generative network is constructed to encode two parts of the face, which are the identity (facial features like mouth, nose and eyes) and expression (including expression, pose and illumination). Through face swapping, we can remove the original ID completely. Secondly, we add an adversarial vector mapping network to perturb the latent code of the face image, different from previous traditional adversarial methods. Through this, we can construct unrestricted adversarial image to decrease ID similarity recognized by model. Our method can flexibly de-identify the face data in various ways and the processed images have high image quality.
Most image super-resolution (SR) methods are developed on synthetic low-resolution (LR) and high-resolution (HR) image pairs that are constructed by a predetermined operation, e.g., bicubic downsampling. As existing methods typically learn an inverse mapping of the specific function, they produce blurry results when applied to real-world images whose exact formulation is different and unknown. Therefore, several methods attempt to synthesize much more diverse LR samples or learn a realistic downsampling model. However, due to restrictive assumptions on the downsampling process, they are still biased and less generalizable. This study proposes a novel method to simulate an unknown downsampling process without imposing restrictive prior knowledge. We propose a generalizable low-frequency loss (LFL) in the adversarial training framework to imitate the distribution of target LR images without using any paired examples. Furthermore, we design an adaptive data loss (ADL) for the downsampler, which can be adaptively learned and updated from the data during the training loops. Extensive experiments validate that our downsampling model can facilitate existing SR methods to perform more accurate reconstructions on various synthetic and real-world examples than the conventional approaches.
We consider the two-dimensional multi-target detection problem of recovering a target image from a noisy measurement that contains multiple copies of the image, each randomly rotated and translated. Motivated by the structure reconstruction problem in single-particle cryo-electron microscopy, we focus on the high noise regime, where the noise hampers accurate detection of the image occurrences. We develop an autocorrelation analysis framework to estimate the image directly from a measurement with an arbitrary spacing distribution of image occurrences, bypassing the estimation of individual locations and rotations. We conduct extensive numerical experiments, and demonstrate image recovery in highly noisy environments. The code to reproduce all numerical experiments is publicly available at https://github.com/krshay/MTD-2D.
Lighting plays a central role in conveying the essence and depth of the subject in a portrait photograph. Professional photographers will carefully control the lighting in their studio to manipulate the appearance of their subject, while consumer photographers are usually constrained to the illumination of their environment. Though prior works have explored techniques for relighting an image, their utility is usually limited due to requirements of specialized hardware, multiple images of the subject under controlled or known illuminations, or accurate models of geometry and reflectance. To this end, we present a system for portrait relighting: a neural network that takes as input a single RGB image of a portrait taken with a standard cellphone camera in an unconstrained environment, and from that image produces a relit image of that subject as though it were illuminated according to any provided environment map. Our method is trained on a small database of 18 individuals captured under different directional light sources in a controlled light stage setup consisting of a densely sampled sphere of lights. Our proposed technique produces quantitatively superior results on our dataset's validation set compared to prior works, and produces convincing qualitative relighting results on a dataset of hundreds of real-world cellphone portraits. Because our technique can produce a 640 $\times$ 640 image in only 160 milliseconds, it may enable interactive user-facing photographic applications in the future.
Segmentation of anatomical regions of interest such as vessels or small lesions in medical images is still a difficult problem that is often tackled with manual input by an expert. One of the major challenges for this task is that the appearance of foreground (positive) regions can be similar to background (negative) regions. As a result, many automatic segmentation algorithms tend to exhibit asymmetric errors, typically producing more false positives than false negatives. In this paper, we aim to leverage this asymmetry and train a diverse ensemble of models with very high recall, while sacrificing their precision. Our core idea is straightforward: A diverse ensemble of low precision and high recall models are likely to make different false positive errors (classifying background as foreground in different parts of the image), but the true positives will tend to be consistent. Thus, in aggregate the false positive errors will cancel out, yielding high performance for the ensemble. Our strategy is general and can be applied with any segmentation model. In three different applications (carotid artery segmentation in a neck CT angiography, myocardium segmentation in a cardiovascular MRI and multiple sclerosis lesion segmentation in a brain MRI), we show how the proposed approach can significantly boost the performance of a baseline segmentation method.
Recently simulation methods have been developed for optical tactile sensors to enable the Sim2Real learning, i.e., firstly training models in simulation before deploying them on the real robot. However, some artefacts in the real objects are unpredictable, such as imperfections caused by fabrication processes, or scratches by the natural wear and tear, and thus cannot be represented in the simulation, resulting in a significant gap between the simulated and real tactile images. To address this Sim2Real gap, we propose a novel texture generation network that maps the simulated images into photorealistic tactile images that resemble a real sensor contacting a real imperfect object. Each simulated tactile image is first divided into two types of regions: areas that are in contact with the object and areas that are not. The former is applied with generated textures learned from real textures in the real tactile images, whereas the latter maintains its appearance as when the sensor is not in contact with any object. This makes sure that the artefacts are only applied to the deformed regions of the sensor. Our extensive experiments show that the proposed texture generation network can generate these realistic artefacts on the deformed regions of the sensor, while avoiding leaking the textures into areas of no contact. Quantitative experiments further reveal that when using the adapted images generated by our proposed network for a Sim2Real classification task, the drop in accuracy caused by the Sim2Real gap is reduced from 38.43% to merely 0.81%. As such, this work has the potential to accelerate the Sim2Real learning for robotic tasks requiring tactile sensing.