Deep learning inference brings together the data and the Convolutional Neural Network (CNN). This is problematic in case the user wants to preserve the privacy of the data and the service provider does not want to reveal the weights of his CNN. Secure Inference allows the two parties to engage in a protocol that preserves their respective privacy concerns, while revealing only the inference result to the user. This is known as Multi-Party Computation (MPC). A major bottleneck of MPC algorithms is communication, as the parties must send data back and forth. The linear component of a CNN (i.e. convolutions) can be done efficiently with minimal communication, but the non-linear part (i.e., ReLU) requires the bulk of communication bandwidth. We propose two ways to accelerate Secure Inference. The first is based on the observation that the ReLU outcome of many convolutions is highly correlated. Therefore, we replace the per pixel ReLU operation by a ReLU operation per patch. Each layer in the network will benefit from a patch of a different size and we devise an algorithm to choose the optimal set of patch sizes through a novel reduction of the problem to a knapsack problem. The second way to accelerate Secure Inference is based on cutting the number of bit comparisons required for a secure ReLU operation. We demonstrate the cumulative effect of these tools in the semi-honest secure 3-party setting for four problems: Classifying ImageNet using ResNet50 backbone, classifying CIFAR100 using ResNet18 backbone, semantic segmentation of ADE20K using MobileNetV2 backbone and semantic segmentation of Pascal VOC 2012 using ResNet50 backbone. Our source code is publicly available: $\href{https://github.com/yg320/secure_inference}{\text{https://github.com/yg320/secure_inference}}$
We propose SampleDepth, a Convolutional Neural Network (CNN), that is suited for an adaptive LiDAR. Typically,LiDAR sampling strategy is pre-defined, constant and independent of the observed scene. Instead of letting a LiDAR sample the scene in this agnostic fashion, SampleDepth determines, adaptively, where it is best to sample the current frame.To do that, SampleDepth uses depth samples from previous time steps to predict a sampling mask for the current frame. Crucially, SampleDepth is trained to optimize the performance of a depth completion downstream task. SampleDepth is evaluated on two different depth completion networks and two LiDAR datasets, KITTI Depth Completion and the newly introduced synthetic dataset, SHIFT. We show that SampleDepth is effective and suitable for different depth completion downstream tasks.
Generative models are becoming ever more powerful, being able to synthesize highly realistic images. We propose an algorithm for taming these models - changing the probability that the model will produce a specific image or image category. We consider generative models that are powered by normalizing flows, which allows us to reason about the exact generation probability likelihood for a given image. Our method is general purpose, and we exemplify it using models that generate human faces, a subdomain with many interesting privacy and bias considerations. Our method can be used in the context of privacy, e.g., removing a specific person from the output of a model, and also in the context of de-biasing by forcing a model to output specific image categories according to a given target distribution. Our method uses a fast fine-tuning process without retraining the model from scratch, achieving the goal in less than 1% of the time taken to initially train the generative model. We evaluate qualitatively and quantitatively, to examine the success of the taming process and output quality.
Scene flow estimation is a long-standing problem in computer vision, where the goal is to find the 3D motion of a scene from its consecutive observations. Recently, there have been efforts to compute the scene flow from 3D point clouds. A common approach is to train a regression model that consumes source and target point clouds and outputs the per-point translation vectors. An alternative is to learn point matches between the point clouds concurrently with regressing a refinement of the initial correspondence flow. In both cases, the learning task is very challenging since the flow regression is done in the free 3D space, and a typical solution is to resort to a large annotated synthetic dataset. We introduce SCOOP, a new method for scene flow estimation that can be learned on a small amount of data without employing ground-truth flow supervision. In contrast to previous work, we train a pure correspondence model focused on learning point feature representation and initialize the flow as the difference between a source point and its softly corresponding target point. Then, in the run-time phase, we directly optimize a flow refinement component with a self-supervised objective, which leads to a coherent and accurate flow field between the point clouds. Experiments on widespread datasets demonstrate the performance gains achieved by our method compared to existing leading techniques while using a fraction of the training data. Our code is publicly available at https://github.com/itailang/SCOOP.
A triangular mesh is one of the most popular 3D data representations. As such, the deployment of deep neural networks for mesh processing is widely spread and is increasingly attracting more attention. However, neural networks are prone to adversarial attacks, where carefully crafted inputs impair the model's functionality. The need to explore these vulnerabilities is a fundamental factor in the future development of 3D-based applications. Recently, mesh attacks were studied on the semantic level, where classifiers are misled to produce wrong predictions. Nevertheless, mesh surfaces possess complex geometric attributes beyond their semantic meaning, and their analysis often includes the need to encode and reconstruct the geometry of the shape. We propose a novel framework for a geometric adversarial attack on a 3D mesh autoencoder. In this setting, an adversarial input mesh deceives the autoencoder by forcing it to reconstruct a different geometric shape at its output. The malicious input is produced by perturbing a clean shape in the spectral domain. Our method leverages the spectral decomposition of the mesh along with additional mesh-related properties to obtain visually credible results that consider the delicacy of surface distortions. Our code is publicly available at https://github.com/StolikTomer/SAGA.
Video anomaly detection is an ill-posed problem because it relies on many parameters such as appearance, pose, camera angle, background, and more. We distill the problem to anomaly detection of human pose, thus reducing the risk of nuisance parameters such as appearance affecting the result. Focusing on pose alone also has the side benefit of reducing bias against distinct minority groups. Our model works directly on human pose graph sequences and is exceptionally lightweight ($\sim1K$ parameters), capable of running on any machine able to run the pose estimation with negligible additional resources. We leverage the highly compact pose representation in a normalizing flows framework, which we extend to tackle the unique characteristics of spatio-temporal pose data and show its advantages in this use case. Our algorithm uses normalizing flows to learn a bijective mapping between the pose data distribution and a Gaussian distribution, using spatio-temporal graph convolution blocks. The algorithm is quite general and can handle training data of only normal examples, as well as a supervised dataset that consists of labeled normal and abnormal examples. We report state-of-the-art results on two anomaly detection benchmarks - the unsupervised ShanghaiTech dataset and the recent supervised UBnormal dataset.
The increasing popularity of facial manipulation (Deepfakes) and synthetic face creation raises the need to develop robust forgery detection solutions. Crucially, most work in this domain assume that the Deepfakes in the test set come from the same Deepfake algorithms that were used for training the network. This is not how things work in practice. Instead, we consider the case where the network is trained on one Deepfake algorithm, and tested on Deepfakes generated by another algorithm. Typically, supervised techniques follow a pipeline of visual feature extraction from a deep backbone, followed by a binary classification head. Instead, our algorithm aggregates features extracted across all layers of one backbone network to detect a fake. We evaluate our approach on two domains of interest - Deepfake detection and Synthetic image detection, and find that we achieve SOTA results.
Point cloud registration (PCR) is an important task in many fields including autonomous driving with LiDAR sensors. PCR algorithms have improved significantly in recent years, by combining deep-learned features with robust estimation methods. These algorithms succeed in scenarios such as indoor scenes and object models registration. However, testing in the automotive LiDAR setting, which presents its own challenges, has been limited. The standard benchmark for this setting, KITTI-10m, has essentially been saturated by recent algorithms: many of them achieve near-perfect recall. In this work, we stress-test recent PCR techniques with LiDAR data. We propose a method for selecting balanced registration sets, which are challenging sets of frame-pairs from LiDAR datasets. They contain a balanced representation of the different relative motions that appear in a dataset, i.e. small and large rotations, small and large offsets in space and time, and various combinations of these. We perform a thorough comparison of accuracy and run-time on these benchmarks. Perhaps unexpectedly, we find that the fastest and simultaneously most accurate approach is a version of advanced RANSAC. We further improve results with a novel pre-filtering method.
How many labeled pixels are needed to segment an image, without any prior knowledge? We conduct an experiment to answer this question. In our experiment, an Oracle is using Active Learning to train a network from scratch. The Oracle has access to the entire label map of the image, but the goal is to reveal as little pixel labels to the network as possible. We find that, on average, the Oracle needs to reveal (i.e., annotate) less than $0.1\%$ of the pixels in order to train a network. The network can then label all pixels in the image at an accuracy of more than $98\%$. Based on this single-image-annotation experiment, we design an experiment to quickly annotate an entire data set. In the data set level experiment the Oracle trains a new network for each image from scratch. The network can then be used to create pseudo-labels, which are the network predicted labels of the unlabeled pixels, for the entire image. Only then, a data set level network is trained from scratch on all the pseudo-labeled images at once. We repeat both image level and data set level experiments on two, very different, real-world data sets, and find that it is possible to reach the performance of a fully annotated data set using a fraction of the annotation cost.
Anomaly detection is a well-established research area that seeks to identify samples outside of a predetermined distribution. An anomaly detection pipeline is comprised of two main stages: (1) feature extraction and (2) normality score assignment. Recent papers used pre-trained networks for feature extraction achieving state-of-the-art results. However, the use of pre-trained networks does not fully-utilize the normal samples that are available at train time. This paper suggests taking advantage of this information by using teacher-student training. In our setting, a pretrained teacher network is used to train a student network on the normal training samples. Since the student network is trained only on normal samples, it is expected to deviate from the teacher network in abnormal cases. This difference can serve as a complementary representation to the pre-trained feature vector. Our method -- Transformaly -- exploits a pre-trained Vision Transformer (ViT) to extract both feature vectors: the pre-trained (agnostic) features and the teacher-student (fine-tuned) features. We report state-of-the-art AUROC results in both the common unimodal setting, where one class is considered normal and the rest are considered abnormal, and the multimodal setting, where all classes but one are considered normal, and just one class is considered abnormal. The code is available at https://github.com/MatanCohen1/Transformaly.