Traditional techniques for measuring similarities between time series are based on handcrafted similarity measures, whereas more recent learning-based approaches cannot exploit external supervision. We combine ideas from time-series modeling and metric learning, and study siamese recurrent networks (SRNs) that minimize a classification loss to learn a good similarity measure between time series. Specifically, our approach learns a vectorial representation for each time series in such a way that similar time series are modeled by similar representations, and dissimilar time series by dissimilar representations. Because it is a similarity prediction models, SRNs are particularly well-suited to challenging scenarios such as signature recognition, in which each person is a separate class and very few examples per class are available. We demonstrate the potential merits of SRNs in within-domain and out-of-domain classification experiments and in one-shot learning experiments on tasks such as signature, voice, and sign language recognition.
Parkinsons Disease is a neurological disorder and prevalent in elderly people. Traditional ways to diagnose the disease rely on in-person subjective clinical evaluations on the quality of a set of activity tests. The high-resolution longitudinal activity data collected by smartphone applications nowadays make it possible to conduct remote and convenient health assessment. However, out-of-lab tests often suffer from poor quality controls as well as irregularly collected observations, leading to noisy test results. To address these issues, we propose a novel time-series based approach to predicting Parkinson's Disease with raw activity test data collected by smartphones in the wild. The proposed method first synchronizes discrete activity tests into multimodal features at unified time points. Next, it distills and enriches local and global representations from noisy data across modalities and temporal observations by two attention modules. With the proposed mechanisms, our model is capable of handling noisy observations and at the same time extracting refined temporal features for improved prediction performance. Quantitative and qualitative results on a large public dataset demonstrate the effectiveness of the proposed approach.
Two factors have proven to be very important to the performance of semantic segmentation models: global context and multi-level semantics. However, generating features that capture both factors always leads to high computational complexity, which is problematic in real-time scenarios. In this paper, we propose a new model, called Attention-Augmented Network (AttaNet), to capture both global context and multilevel semantics while keeping the efficiency high. AttaNet consists of two primary modules: Strip Attention Module (SAM) and Attention Fusion Module (AFM). Viewing that in challenging images with low segmentation accuracy, there are a significantly larger amount of vertical strip areas than horizontal ones, SAM utilizes a striping operation to reduce the complexity of encoding global context in the vertical direction drastically while keeping most of contextual information, compared to the non-local approaches. Moreover, AFM follows a cross-level aggregation strategy to limit the computation, and adopts an attention strategy to weight the importance of different levels of features at each pixel when fusing them, obtaining an efficient multi-level representation. We have conducted extensive experiments on two semantic segmentation benchmarks, and our network achieves different levels of speed/accuracy trade-offs on Cityscapes, e.g., 71 FPS/79.9% mIoU, 130 FPS/78.5% mIoU, and 180 FPS/70.1% mIoU, and leading performance on ADE20K as well.
Recently channel state information (CSI) measurements from commercial multi input multi output (MIMO) WiFi systems have been ubiquitously used for different wireless sensing applications. However, the phase of the CSI realizations is usually distorted severely by phase errors due to the hardware impairments, which significantly reduce the sensing performance. In this paper, we directly utilize the modeling of the phase distortions caused by the hardware impairments and propose an adaptive CSI estimation approach based on Kalman filter (KF) with maximum a posteriori (MAP) estimation that considers the CSI from the previous time. The performance of the proposed algorithm is compared against the Cramer Rao lower bound (CRLB). Simulation and experimental results demonstrate that our approach can track the channel variations while eliminating the phase errors accurately.
Efficient low-variance gradient estimation enabled by the reparameterization trick (RT) has been essential to the success of variational autoencoders. Doubly-reparameterized gradients (DReGs) improve on the RT for multi-sample variational bounds by applying reparameterization a second time for an additional reduction in variance. Here, we develop two generalizations of the DReGs estimator and show that they can be used to train conditional and hierarchical VAEs on image modelling tasks more effectively. We first extend the estimator to hierarchical models with several stochastic layers by showing how to treat additional score function terms due to the hierarchical variational posterior. We then generalize DReGs to score functions of arbitrary distributions instead of just those of the sampling distribution, which makes the estimator applicable to the parameters of the prior in addition to those of the posterior.
New deep-learning architectures are created every year, achieving state-of-the-art results in image recognition and leading to the belief that, in a few years, complex tasks such as sign language translation will be considerably easier, serving as a communication tool for the hearing-impaired community. On the other hand, these algorithms still need a lot of data to be trained and the dataset creation process is expensive, time-consuming, and slow. Thereby, this work aims to investigate techniques of digital image processing and machine learning that can be used to create a sign language dataset effectively. We argue about data acquisition, such as the frames per second rate to capture or subsample the videos, the background type, preprocessing, and data augmentation, using convolutional neural networks and object detection to create an image classifier and comparing the results based on statistical tests. Different datasets were created to test the hypotheses, containing 14 words used daily and recorded by different smartphones in the RGB color system. We achieved an accuracy of 96.38% on the test set and 81.36% on the validation set containing more challenging conditions, showing that 30 FPS is the best frame rate subsample to train the classifier, geometric transformations work better than intensity transformations, and artificial background creation is not effective to model generalization. These trade-offs should be considered in future work as a cost-benefit guideline between computational cost and accuracy gain when creating a dataset and training a sign recognition model.
Chronic respiratory diseases, such as chronic obstructive pulmonary disease and asthma, are a serious health crisis, affecting a large number of people globally and inflicting major costs on the economy. Current methods for assessing the progression of respiratory symptoms are either subjective and inaccurate, or complex and cumbersome, and do not incorporate environmental factors. Lacking predictive assessments and early intervention, unexpected exacerbations can lead to hospitalizations and high medical costs. This work presents a multi-modal solution for predicting the exacerbation risks of respiratory diseases, such as COPD, based on a novel spatio-temporal machine learning architecture for real-time and accurate respiratory events detection, and tracking of local environmental and meteorological data and trends. The proposed new machine learning architecture blends key attributes of both convolutional and recurrent neural networks, allowing extraction of both spatial and temporal features encoded in respiratory sounds, thereby leading to accurate classification and tracking of symptoms. Combined with the data from environmental and meteorological sensors, and a predictive model based on retrospective medical studies, this solution can assess and provide early warnings of respiratory disease exacerbations. This research will improve the quality of patients' lives through early medical intervention, thereby reducing hospitalization rates and medical costs.
Since the renaissance of deep learning (DL), facial expression recognition (FER) has received a lot of interest, with continual improvement in the performance. Hand-in-hand with performance, new challenges have come up. Modern FER systems deal with face images captured under uncontrolled conditions (also called in-the-wild scenario) including occlusions and pose variations. They successfully handle such conditions using deep networks that come with various components like transfer learning, attention mechanism and local-global context extractor. However, these deep networks are highly complex with large number of parameters, making them unfit to be deployed in real scenarios. Is it possible to build a light-weight network that can still show significantly good performance on FER under in-the-wild scenario? In this work, we methodically build such a network and call it as Imponderous Net. We leverage on the aforementioned components of deep networks for FER, and analyse, carefully choose and fit them to arrive at Imponderous Net. Our Imponderous Net is a low calorie net with only 1.45M parameters, which is almost 50x less than that of a state-of-the-art (SOTA) architecture. Further, during inference, it can process at the real time rate of 40 frames per second (fps) in an intel-i7 cpu. Though it is low calorie, it is still power packed in its performance, overpowering other light-weight architectures and even few high capacity architectures. Specifically, Imponderous Net reports 87.09\%, 88.17\% and 62.06\% accuracies on in-the-wild datasets RAFDB, FERPlus and AffectNet respectively. It also exhibits superior robustness under occlusions and pose variations in comparison to other light-weight architectures from the literature.
Research indicates that monotonous automated driving increases the incidence of fatigued driving. Although many prediction models based on advanced machine learning techniques were proposed to monitor driver fatigue, especially in manual driving, little is known about how these black-box machine learning models work. In this paper, we proposed a combination of eXtreme Gradient Boosting (XGBoost) and SHAP (SHapley Additive exPlanations) to predict driver fatigue with explanations due to their efficiency and accuracy. First, in order to obtain the ground truth of driver fatigue, PERCLOS (percentage of eyelid closure over the pupil over time) between 0 and 100 was used as the response variable. Second, we built a driver fatigue regression model using both physiological and behavioral measures with XGBoost and it outperformed other selected machine learning models with 3.847 root-mean-squared error (RMSE), 1.768 mean absolute error (MAE) and 0.996 adjusted $R^2$. Third, we employed SHAP to identify the most important predictor variables and uncovered the black-box XGBoost model by showing the main effects of most important predictor variables globally and explaining individual predictions locally. Such an explainable driver fatigue prediction model offered insights into how to intervene in automated driving when necessary, such as during the takeover transition period from automated driving to manual driving.
This paper proposes a framework for 3D obstacle avoidance in the presence of partial observability of environment obstacles. The method focuses on the utility of the Artificial Potential Field (APF) controller in a practical setting where noisy and incomplete information about the proximity is inevitable. We propose a Particle Filter (PF) approach to estimate potential obstacle locations in an input depth image stream. The probable candidates are then used to generate an action that maneuvers the robot towards the negative gradient of potential at each time instant. Rigorous experimental validation on a quadrotor UAV demonstrates the robustness and reliability of the method when robot's sensitivity to incorrect perception information can be concerning. The proposed method is highly compute efficient for real-time applications and agile robots.