Abstract:We introduce STAViS, a spatio-temporal audiovisual saliency network that combines spatio-temporal visual and auditory information in order to efficiently address the problem of saliency estimation in videos. Our approach employs a single network that combines visual saliency and auditory features and learns to appropriately localize sound sources and to fuse the two saliencies in order to obtain a final saliency map. The network has been designed, trained end-to-end, and evaluated on six different databases that contain audiovisual eye-tracking data of a large variety of videos. We compare our method against 8 different state-of-the-art visual saliency models. Evaluation results across databases indicate that our STAViS model outperforms our visual only variant as well as the other state-of-the-art models in the majority of cases. Also, the consistently good performance it achieves for all databases indicates that it is appropriate for estimating saliency "in-the-wild".
Abstract:Tropical Geometry and Mathematical Morphology share the same max-plus and min-plus semiring arithmetic and matrix algebra. In this chapter we summarize some of their main ideas and common (geometric and algebraic) structure, generalize and extend both of them using weighted lattices and a max-$\star$ algebra with an arbitrary binary operation $\star$ that distributes over max, and outline applications to geometry, machine learning, and optimization. Further, we generalize tropical geometrical objects using weighted lattices. Finally, we provide the optimal solution of max-$\star$ equations using morphological adjunctions that are projections on weighted lattices, and apply it to optimal piecewise-linear regression for fitting max-$\star$ tropical curves and surfaces to arbitrary data that constitute polygonal or polyhedral shape approximations. This also includes an efficient algorithm for solving the convex regression problem of data fitting with max-affine functions.
Abstract:In this work, we examine the process of Tropical Polynomial Division, a geometric method which seeks to emulate the division of regular polynomials, when applied to those of the max-plus semiring. This is done via the approximation of the Newton Polytope of the dividend polynomial by that of the divisor. This process is afterwards generalized and applied in the context of neural networks with ReLU activations. In particular, we make use of the intuition it provides, in order to minimize a two-layer fully connected network, trained for a binary classification problem. This method is later evaluated on a variety of experiments, demonstrating its capability to approximate a network, with minimal loss in performance.
Abstract:Instrument classification is one of the fields in Music Information Retrieval (MIR) that has attracted a lot of research interest. However, the majority of that is dealing with monophonic music, while efforts on polyphonic material mainly focus on predominant instrument recognition or multi-instrument recognition for entire tracks. We present an approach for instrument classification in polyphonic music using monophonic training data that involves mixing-augmentation methods. Specifically, we experiment with pitch and tempo-based synchronization, as well as mixes of tracks with similar music genres. Further, a custom CNN model is proposed, that uses the augmented training data efficiently and a plethora of suitable evaluation metrics are discussed as well. The tempo-sync and genre techniques stand out, achieving an 81% label ranking average precision accuracy, detecting up to 9 instruments in over 2300 testing tracks.
Abstract:We examine 3D reconstruction of architectural scenes in unordered sets of uncalibrated images. We introduce a linear method to self-calibrate and find the metric reconstruction of a camera pair. We assume unknown and different focal lengths but otherwise known internal camera parameters and a known projective reconstruction of the camera pair. We recover two possible camera configurations in space and use the Cheirality condition, that all 3D scene points are in front of both cameras, to disambiguate the solution. We show in two Theorems, first that the two solutions are in mirror positions and then the relations between their viewing directions. Our new method performs on par (median rotation error $\Delta R = 3.49^{\circ}$) with the standard approach of Kruppa equations ($\Delta R = 3.77^{\circ}$) for self-calibration and 5-Point algorithm for calibrated metric reconstruction of a camera pair. We reject erroneous image correspondences by introducing a method to examine whether point correspondences appear in the same order along $x, y$ image axes in image pairs. We evaluate this method by its precision and recall and show that it improves the robustness of point matches in architectural and general scenes. Finally, we integrate all the introduced methods to a 3D reconstruction pipeline. We utilize the numerous camera pair metric recontructions using rotation-averaging algorithms and a novel method to average focal length estimates.
Abstract:In this paper, we introduce Channel-wise recurrent convolutional neural networks (RecNets), a family of novel, compact neural network architectures for computer vision tasks inspired by recurrent neural networks (RNNs). RecNets build upon Channel-wise recurrent convolutional (CRC) layers, a novel type of convolutional layer that splits the input channels into disjoint segments and processes them in a recurrent fashion. In this way, we simulate wide, yet compact models, since the number of parameters is vastly reduced via the parameter sharing of the RNN formulation. Experimental results on the CIFAR-10 and CIFAR-100 image classification tasks demonstrate the superior size-accuracy trade-off of RecNets compared to other compact state-of-the-art architectures.
Abstract:In this work, we present a novel framework for on-line human gait stability prediction of the elderly users of an intelligent robotic rollator using Long Short Term Memory (LSTM) networks, fusing multimodal RGB-D and Laser Range Finder (LRF) data from non-wearable sensors. A Deep Learning (DL) based approach is used for the upper body pose estimation. The detected pose is used for estimating the body Center of Mass (CoM) using Unscented Kalman Filter (UKF). An Augmented Gait State Estimation framework exploits the LRF data to estimate the legs' positions and the respective gait phase. These estimates are the inputs of an encoder-decoder sequence to sequence model which predicts the gait stability state as Safe or Fall Risk walking. It is validated with data from real patients, by exploring different network architectures, hyperparameter settings and by comparing the proposed method with other baselines. The presented LSTM-based human gait stability predictor is shown to provide robust predictions of the human stability state, and thus has the potential to be integrated into a general user-adaptive control architecture as a fall-risk alarm.
Abstract:Detecting visual relationships, i.e. <Subject, Predicate, Object> triplets, is a challenging Scene Understanding task approached in the past via linguistic priors or spatial information in a single feature branch. We introduce a new deeply supervised two-branch architecture, the Multimodal Attentional Translation Embeddings, where the visual features of each branch are driven by a multimodal attentional mechanism that exploits spatio-linguistic similarities in a low-dimensional space. We present a variety of experiments comparing against all related approaches in the literature, as well as by re-implementing and fine-tuning several of them. Results on the commonly employed VRD dataset [1] show that the proposed method clearly outperforms all others, while we also justify our claims both quantitatively and qualitatively.
Abstract:In this paper we address the problem of multi-cue affect recognition in challenging environments such as child-robot interaction. Towards this goal we propose a method for automatic recognition of affect that leverages body expressions alongside facial expressions, as opposed to traditional methods that usually focus only on the latter. We evaluate our methods on a challenging child-robot interaction database of emotional expressions, as well as on a database of emotional expressions by actors, and show that the proposed method achieves significantly better results when compared with the facial expression baselines, can be trained both jointly and separately, and offers us computational models for both the individual modalities, as well as for the whole body emotion.
Abstract:The great success of convolutional neural networks has caused a massive spread of the use of such models in a large variety of Computer Vision applications. However, these models are vulnerable to certain inputs, the adversarial examples, which although are not easily perceived by humans, they can lead a neural network to produce faulty results. This paper focuses on the detection of adversarial examples, which are created for convolutional neural networks that perform image classification. We propose three methods for detecting possible adversarial examples and after we analyze and compare their performance, we combine their best aspects to develop an even more robust approach. The first proposed method is based on the regularization of the feature vector that the neural network produces as output. The second method detects adversarial examples by using histograms, which are created from the outputs of the hidden layers of the neural network. These histograms create a feature vector which is used as the input of an SVM classifier, which classifies the original input either as an adversarial or as a real input. Finally, for the third method we introduce the concept of the residual image, which contains information about the parts of the input pattern that are ignored by the neural network. This method aims at the detection of possible adversarial examples, by using the residual image and reinforcing the parts of the input pattern that are ignored by the neural network. Each one of these methods has some novelties and by combining them we can further improve the detection results. For the proposed methods and their combination, we present the results of detecting adversarial examples on the MNIST dataset. The combination of the proposed methods offers some improvements over similar state of the art approaches.