Salient object detection (SOD), which simulates the human visual perception system to locate the most attractive object(s) in a scene, has been widely applied to various computer vision tasks. Now, with the advent of depth sensors, depth maps with affluent spatial information that can be beneficial in boosting the performance of SOD, can easily be captured. Although various RGB-D based SOD models with promising performance have been proposed over the past several years, an in-depth understanding of these models and challenges in this topic remains lacking. In this paper, we provide a comprehensive survey of RGB-D based SOD models from various perspectives, and review related benchmark datasets in detail. Further, considering that the light field can also provide depth maps, we review SOD models and popular benchmark datasets from this domain as well. Moreover, to investigate the SOD ability of existing models, we carry out a comprehensive evaluation, as well as attribute-based evaluation of several representative RGB-D based SOD models. Finally, we discuss several challenges and open directions of RGB-D based SOD for future research. All collected models, benchmark datasets, source code links, datasets constructed for attribute-based evaluation, and codes for evaluation will be made publicly available at https://github.com/taozh2017/RGBDSODsurvey
Micro-expressions (MEs) are brief and involuntary facial expressions that occur when people are trying to hide their true feelings or conceal their emotions. Based on psychology research, MEs play an important role in understanding genuine emotions, which leads to many potential applications. Therefore, ME analysis has been becoming an attractive topic for various research areas, such as psychology, law enforcement, and psychotherapy. In the computer vision field, the study of MEs can be divided into two main tasks: spotting and recognition, which are to identify positions of MEs in videos and determine the emotion category of detected MEs, respectively. Recently, although much research has been done, the construction of a fully automatic system for analyzing MEs is still far away from practice. This is because of two main reasons: most of the research in MEs only focuses on the recognition part while abandons the spotting task; current public datasets for ME spotting are not challenging enough to support developing a robust spotting algorithm. Our contributions in this paper are three folds: (1) We introduce an extension of the SMIC-E database, namely SMIC-E-Long database, which is a new challenging benchmark for ME spotting. (2) We suggest a new evaluation protocol that standardizes the comparison of various ME spotting techniques. (3) Extensive experiments with handcrafted and deep learning-based approaches on the SMIC-E-Long database are performed for baseline evaluation.
Unmanned Surface Vehicles technology (USVs) is an exciting topic that essentially deploys an algorithm to safely and efficiently performs a mission. Although reinforcement learning is a well-known approach to modeling such a task, instability and divergence may occur when combining off-policy and function approximation. In this work, we used deep reinforcement learning combining Q-learning with a neural representation to avoid instability. Our methodology uses deep q-learning and combines it with a rolling wave planning approach on agile methodology. Our method contains two critical parts in order to perform missions in an unknown environment. The first is a path planner that is responsible for generating a potential effective path to a destination without considering the details of the root. The latter is a decision-making module that is responsible for short-term decisions on avoiding obstacles during the near future steps of USV exploitation within the context of the value function. Simulations were performed using two algorithms: a basic vanilla vessel navigator (VVN) as a baseline and an improved one for the vessel navigator with a planner and local view (VNPLV). Experimental results show that the proposed method enhanced the performance of VVN by 55.31 on average for long-distance missions. Our model successfully demonstrated obstacle avoidance by means of deep reinforcement learning using planning adaptive paths in unknown environments.
Mouse dynamics is a potential means of authenticating users. Typically, the authentication process is based on classical machine learning techniques, but recently, deep learning techniques have been introduced for this purpose. Although prior research has demonstrated how machine learning and deep learning algorithms can be bypassed by carefully crafted adversarial samples, there has been very little research performed on the topic of behavioural biometrics in the adversarial domain. In an attempt to address this gap, we built a set of attacks, which are applications of several generative approaches, to construct adversarial mouse trajectories that bypass authentication models. These generated mouse sequences will serve as the adversarial samples in the context of our experiments. We also present an analysis of the attack approaches we explored, explaining their limitations. In contrast to previous work, we consider the attacks in a more realistic and challenging setting in which an attacker has access to recorded user data but does not have access to the authentication model or its outputs. We explore three different attack strategies: 1) statistics-based, 2) imitation-based, and 3) surrogate-based; we show that they are able to evade the functionality of the authentication models, thereby impacting their robustness adversely. We show that imitation-based attacks often perform better than surrogate-based attacks, unless, however, the attacker can guess the architecture of the authentication model. In such cases, we propose a potential detection mechanism against surrogate-based attacks.
Interpretability has become an important topic of research as more machine learning (ML) models are deployed and widely used to make important decisions. Due to it's complexity, i For high-stakes domains such as medical, providing intuitive explanations that can be consumed by domain experts without ML expertise becomes crucial. To this demand, concept-based methods (e.g., TCAV) were introduced to provide explanations using user-chosen high-level concepts rather than individual input features. While these methods successfully leverage rich representations learned by the networks to reveal how human-defined concepts are related to the prediction, they require users to select concepts of their choice and collect labeled examples of those concepts. In this work, we introduce DTCAV (Discovery TCAV) a global concept-based interpretability method that can automatically discover concepts as image segments, along with each concept's estimated importance for a deep neural network's predictions. We validate that discovered concepts are as coherent to humans as hand-labeled concepts. We also show that the discovered concepts carry significant signal for prediction by analyzing a network's performance with stitched/added/deleted concepts. DTCAV results revealed a number of undesirable correlations (e.g., a basketball player's jersey was a more important concept for predicting the basketball class than the ball itself) and show the potential shallow reasoning of these networks.
Refrigeration and chiller optimization is an important and well studied topic in mechanical engineering, mostly taking advantage of physical models, designed on top of over-simplified assumptions, over the equipments. Conventional optimization techniques using physical models make decisions of online parameter tuning, based on very limited information of hardware specifications and external conditions, e.g., outdoor weather. In recent years, new generation of sensors is becoming essential part of new chiller plants, for the first time allowing the system administrators to continuously monitor the running status of all equipments in a timely and accurate way. The explosive growth of data flowing to databases, driven by the increasing analytical power by machine learning and data mining, unveils new possibilities of data-driven approaches for real-time chiller plant optimization. This paper presents our research and industrial experience on the adoption of data models and optimizations on chiller plant and discusses the lessons learnt from our practice on real world plants. Instead of employing complex machine learning models, we emphasize the incorporation of appropriate domain knowledge into data analysis tools, which turns out to be the key performance improver over state-of-the-art deep learning techniques by a significant margin. Our empirical evaluation on a real world chiller plant achieves savings by more than 7% on daily power consumption.
Highly accurate 3D volumetric reconstruction is still an open research topic where the main difficulties are usually related to merging rough estimations with high frequency details. One of the most promising methods is the fusion between multi-view stereo and photometric imaging 3D shape reconstruction techniques. However, beside the intrinsic difficulties that multi-view stereo and photometric stereo have to make them working reliably, supplementary problems raise when considered together. Most importantly, the projection of the fine details usually retrievable with photometric stereo onto the rough multi-view stereo reconstruction is difficult to handle. In this work, we present a volumetric approach to the multi-view photometric stereo problem defined by a unified differential model. The key to our method is the signed distance field parameterisation which avoids the complex step of re-projecting high frequency details as the parameterisation of the whole volume allows a photometric modeling on the volume itself efficiently dealing with occlusions, discontinuities, etc. The relation between the surface normals and the gradient of the signed distance field leads to a homogeneous linear partial differential equation. A variational optimisation is adopted in order to combine multiple images from multiple points of view in a single system avoiding the need of merging depth maps. Our approach is evaluated on synthetic and real data-sets and achieves state-of-the-art results.
Just like many other topics in computer vision, image classification has achieved significant progress recently by using deep-learning neural networks, especially the Convolutional Neural Networks (CNN). Most of the existing works are focused on classifying very clear natural images, evidenced by the widely used image databases such as Caltech-256, PASCAL VOCs and ImageNet. However, in many real applications, the acquired images may contain certain degradations that lead to various kinds of blurring, noise, and distortions. One important and interesting problem is the effect of such degradations to the performance of CNN-based image classification. More specifically, we wonder whether image-classification performance drops with each kind of degradation, whether this drop can be avoided by including degraded images into training, and whether existing computer vision algorithms that attempt to remove such degradations can help improve the image-classification performance. In this paper, we empirically study this problem for four kinds of degraded images -- hazy images, underwater images, motion-blurred images and fish-eye images. For this study, we synthesize a large number of such degraded images by applying respective physical models to the clear natural images and collect a new hazy image dataset from the Internet. We expect this work can draw more interests from the community to study the classification of degraded images.
Interfacing a kinetic action of a person to an action of a machine system is an important research topic in many application areas. One of the key factors for intimate human-machine interaction is the ability of the control algorithm to detect and classify different user commands with shortest possible latency, thus making a highly correlated link between cause and effect. In our research, we focused on the task of mapping user kinematic actions into sound samples. The presented methodology relies on the wireless sensor nodes equipped with inertial measurement units and the real-time algorithm dedicated for early detection and classification of a variety of movements/gestures performed by a user. The core algorithm is based on the approximate Bayesian inference of Vector Autoregressive Hierarchical Hidden Markov Models (VAR-HHMM), where models database is derived from the set of motion gestures. The performance of the algorithm was compared with an online version of the K-nearest neighbours (KNN) algorithm, where we used offline expert based classification as the benchmark. In almost all of the evaluation metrics (e.g. confusion matrix, recall and precision scores) the VAR-HHMM algorithm outperformed KNN. Furthermore, the VAR-HHMM algorithm, in some cases, achieved faster movement onset detection compared with the offline standard. The proposed concept, although envisioned for movement-to-sound application, could be implemented in other human-machine interfaces.
We introduce the multiresolution recurrent neural network, which extends the sequence-to-sequence framework to model natural language generation as two parallel discrete stochastic processes: a sequence of high-level coarse tokens, and a sequence of natural language tokens. There are many ways to estimate or learn the high-level coarse tokens, but we argue that a simple extraction procedure is sufficient to capture a wealth of high-level discourse semantics. Such procedure allows training the multiresolution recurrent neural network by maximizing the exact joint log-likelihood over both sequences. In contrast to the standard log- likelihood objective w.r.t. natural language tokens (word perplexity), optimizing the joint log-likelihood biases the model towards modeling high-level abstractions. We apply the proposed model to the task of dialogue response generation in two challenging domains: the Ubuntu technical support domain, and Twitter conversations. On Ubuntu, the model outperforms competing approaches by a substantial margin, achieving state-of-the-art results according to both automatic evaluation metrics and a human evaluation study. On Twitter, the model appears to generate more relevant and on-topic responses according to automatic evaluation metrics. Finally, our experiments demonstrate that the proposed model is more adept at overcoming the sparsity of natural language and is better able to capture long-term structure.