Audio signals are often represented as spectrograms and treated as 2D images. In this light, deep convolutional architectures are widely used for music audio tasks even though these two data types have very different structures. In this work, we attempt to "open the black-box" on deep convolutional models to inform future architectures for music audio tasks, and explain the excellent performance of deep convolutions that model spectrograms as 2D images. To this end, we expand recent explainability discussions in deep learning for natural image data to music audio data through systematic experiments using the deep features learned by various convolutional architectures. We demonstrate that deep convolutional features perform well across various target tasks, whether or not they are extracted from deep architectures originally trained on that task. Additionally, deep features exhibit high similarity to hand-crafted wavelet features, whether the deep features are extracted from a trained or untrained model.
The lower the distortion of an estimator, the more the distribution of its outputs generally deviates from the distribution of the signals it attempts to estimate. This phenomenon, known as the perception-distortion tradeoff, has captured significant attention in image restoration, where it implies that fidelity to ground truth images comes at the expense of perceptual quality (deviation from statistics of natural images). However, despite the increasing popularity of performing comparisons on the perception-distortion plane, there remains an important open question: what is the minimal distortion that can be achieved under a given perception constraint? In this paper, we derive a closed form expression for this distortion-perception (DP) function for the mean squared-error (MSE) distortion and the Wasserstein-2 perception index. We prove that the DP function is always quadratic, regardless of the underlying distribution. This stems from the fact that estimators on the DP curve form a geodesic in Wasserstein space. In the Gaussian setting, we further provide a closed form expression for such estimators. For general distributions, we show how these estimators can be constructed from the estimators at the two extremes of the tradeoff: The global MSE minimizer, and a minimizer of the MSE under a perfect perceptual quality constraint. The latter can be obtained as a stochastic transformation of the former.
Any spatio-temporal movement or reorientation of the hand, done with the intention of conveying a specific meaning, can be considered as a hand gesture. Inputs to hand gesture recognition systems can be in several forms, such as depth images, monocular RGB, or skeleton joint points. We observe that raw depth images possess low contrasts in the hand regions of interest (ROI). They do not highlight important details to learn, such as finger bending information (whether a finger is overlapping the palm, or another finger). Recently, in deep-learning--based dynamic hand gesture recognition, researchers are tying to fuse different input modalities (e.g. RGB or depth images and hand skeleton joint points) to improve the recognition accuracy. In this paper, we focus on dynamic hand gesture (DHG) recognition using depth quantized image features and hand skeleton joint points. In particular, we explore the effect of using depth-quantized features in Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based multi-modal fusion networks. We find that our method improves existing results on the SHREC-DHG-14 dataset. Furthermore, using our method, we show that it is possible to reduce the resolution of the input images by more than four times and still obtain comparable or better accuracy to that of the resolutions used in previous methods.
In this paper, we present a novel end-to-end deep neural network model for autonomous driving that takes monocular image sequence as input, and directly generates the steering control angle. Firstly, we model the end-to-end driving problem as a local path planning process. Inspired by the environmental representation in the classical planning algorithms(i.e. the beam curvature method), pixel-wise orientations are fed into the network to learn direction-aware features. Next, to handle the imbalanced distribution of steering values in training datasets, we propose an improvement on a cost-sensitive loss function named SteeringLoss2. Besides, we also present a new end-to-end driving dataset, which provides corresponding LiDAR and image sequences, as well as standard driving behaviors. Our dataset includes multiple driving scenarios, such as urban, country, and off-road. Numerous experiments are conducted on both public available LiVi-Set and our own dataset, and the results show that the model using our proposed methods can predict steering angle accurately.
This paper proposes a novel feature called spectrum congruency for describing edges in images. The spectrum congruency is a generalization of the phase congruency, which depicts how much each Fourier components of the image are congruent in phase. Instead of using fixed bases in phase congruency, the spectrum congruency measures the congruency of the energy distribution of multiscale patches in a data-driven transform domain, which is more adaptable to the input images. Multiscale image patches are used to acquire different frequency components for modeling the local energy and amplitude. The spectrum congruency coincides nicely with human visions of perceiving features and provides a more reliable way of detecting edges. Unlike most existing differential-based multiscale edge detectors which simply combine the multiscale information, our method focuses on exploiting the correlation of the multiscale patches based on their local energy. We test our proposed method on synthetic and real images, and the results demonstrate that our approach is practical and highly robust to noise.
State-of-the-art object detectors are vulnerable to localized patch hiding attacks where an adversary introduces a small adversarial patch to make detectors miss the detection of salient objects. In this paper, we propose the first general framework for building provably robust detectors against the localized patch hiding attack called DetectorGuard. To start with, we propose a general approach for transferring the robustness from image classifiers to object detectors, which builds a bridge between robust image classification and robust object detection. We apply a provably robust image classifier to a sliding window over the image and aggregates robust window classifications at different locations for a robust object detection. Second, in order to mitigate the notorious trade-off between clean performance and provable robustness, we use a prediction pipeline in which we compare the outputs of a conventional detector and a robust detector for catching an ongoing attack. When no attack is detected, DetectorGuard outputs the precise bounding boxes predicted by the conventional detector to achieve a high clean performance; otherwise, DetectorGuard triggers an attack alert for security. Notably, our prediction strategy ensures that the robust detector incorrectly missing objects will not hurt the clean performance of DetectorGuard. Moreover, our approach allows us to formally prove the robustness of DetectorGuard on certified objects, i.e., it either detects the object or triggers an alert, against any patch hiding attacker. Our evaluation on the PASCAL VOC and MS COCO datasets shows that DetectorGuard has the almost same clean performance as conventional detectors, and more importantly, that DetectorGuard achieves the first provable robustness against localized patch hiding attacks.
Transformers are more and more popular in computer vision, which treat an image as a sequence of patches and learn robust global features from the sequence. However, a suitable vehicle re-identification method should consider both robust global features and discriminative local features. In this paper, we propose a graph interactive transformer (GiT) for vehicle re-identification. On the whole, we stack multiple GiT blocks to build a competitive vehicle re-identification model, in where each GiT block employs a novel local correlation graph (LCG) module to extract discriminative local features within patches and uses a transformer layer to extract robust global features among patches. In detail, in the current GiT block, the LCG module learns local features from local and global features resulting from the LCG module and transformer layer of the previous GiT block. Similarly, the transformer layer learns global features from the global features generated by the transformer layer of the previous GiT block and the new local features outputted via the LCG module of the current GiT block. Therefore, LCG modules and transformer layers are in a coupled status, bringing effective cooperation between local and global features. This is the first work to combine graphs and transformers for vehicle re-identification to the best of our knowledge. Extensive experiments on three large-scale vehicle re-identification datasets demonstrate that our method is superior to state-of-the-art approaches. The code will be available soon.
The ability to recognize the position and order of the floor-level lines that divide adjacent building floors can benefit many applications, for example, urban augmented reality (AR). This work tackles the problem of locating floor-level lines in street-view images, using a supervised deep learning approach. Unfortunately, very little data is available for training such a network $-$ current street-view datasets contain either semantic annotations that lack geometric attributes, or rectified facades without perspective priors. To address this issue, we first compile a new dataset and develop a new data augmentation scheme to synthesize training samples by harassing (i) the rich semantics of existing rectified facades and (ii) perspective priors of buildings in diverse street views. Next, we design FloorLevel-Net, a multi-task learning network that associates explicit features of building facades and implicit floor-level lines, along with a height-attention mechanism to help enforce a vertical ordering of floor-level lines. The generated segmentations are then passed to a second-stage geometry post-processing to exploit self-constrained geometric priors for plausible and consistent reconstruction of floor-level lines. Quantitative and qualitative evaluations conducted on assorted facades in existing datasets and street views from Google demonstrate the effectiveness of our approach. Also, we present context-aware image overlay results and show the potentials of our approach in enriching AR-related applications.
This paper addresses the problem of 3D human pose and shape estimation from a single image. Previous approaches consider a parametric model of the human body, SMPL, and attempt to regress the model parameters that give rise to a mesh consistent with image evidence. This parameter regression has been a very challenging task, with model-based approaches underperforming compared to nonparametric solutions in terms of pose estimation. In our work, we propose to relax this heavy reliance on the model's parameter space. We still retain the topology of the SMPL template mesh, but instead of predicting model parameters, we directly regress the 3D location of the mesh vertices. This is a heavy task for a typical network, but our key insight is that the regression becomes significantly easier using a Graph-CNN. This architecture allows us to explicitly encode the template mesh structure within the network and leverage the spatial locality the mesh has to offer. Image-based features are attached to the mesh vertices and the Graph-CNN is responsible to process them on the mesh structure, while the regression target for each vertex is its 3D location. Having recovered the complete 3D geometry of the mesh, if we still require a specific model parametrization, this can be reliably regressed from the vertices locations. We demonstrate the flexibility and the effectiveness of our proposed graph-based mesh regression by attaching different types of features on the mesh vertices. In all cases, we outperform the comparable baselines relying on model parameter regression, while we also achieve state-of-the-art results among model-based pose estimation approaches.
This paper focuses on monocular 3D object detection, one of the essential modules in autonomous driving systems. A key challenge is that the depth recovery problem is ill-posed in monocular data. In this work, we first conduct a thorough analysis to reveal how existing methods fail to robustly estimate depth when different geometry shifts occur. In particular, through a series of image-based and instance-based manipulations for current detectors, we illustrate existing detectors are vulnerable in capturing the consistent relationships between depth and both object apparent sizes and positions. To alleviate this issue and improve the robustness of detectors, we convert the aforementioned manipulations into four corresponding 3D-aware data augmentation techniques. At the image-level, we randomly manipulate the camera system, including its focal length, receptive field and location, to generate new training images with geometric shifts. At the instance level, we crop the foreground objects and randomly paste them to other scenes to generate new training instances. All the proposed augmentation techniques share the virtue that geometry relationships in objects are preserved while their geometry is manipulated. In light of the proposed data augmentation methods, not only the instability of depth recovery is effectively alleviated, but also the final 3D detection performance is significantly improved. This leads to superior improvements on the KITTI and nuScenes monocular 3D detection benchmarks with state-of-the-art results.