Road-vehicle accidents are mostly due to human errors, and many such accidents could be avoided by continuously monitoring the driver. Driver monitoring (DM) is a topic of growing interest in the automotive industry, and it will remain relevant for all vehicles that are not fully autonomous, and thus for decades for the average vehicle owner. The present paper focuses on the first step of DM, which consists in characterizing the state of the driver. Since DM will be increasingly linked to driving automation (DA), this paper presents a clear view of the role of DM at each of the six SAE levels of DA. This paper surveys the state of the art of DM, and then synthesizes it, providing a unique, structured, polychotomous view of the many characterization techniques of DM. Informed by the survey, the paper characterizes the driver state along the five main dimensions--called here "(sub)states"--of drowsiness, mental workload, distraction, emotions, and under the influence. The polychotomous view of DM is presented through a pair of interlocked tables that relate these states to their indicators (e.g., the eye-blink rate) and the sensors that can access each of these indicators (e.g., a camera). The tables factor in not only the effects linked directly to the driver, but also those linked to the (driven) vehicle and the (driving) environment. They show, at a glance, to concerned researchers, equipment providers, and vehicle manufacturers (1) most of the options they have to implement various forms of advanced DM systems, and (2) fruitful areas for further research and innovation.
The content on the web is in a constant state of flux. New entities, issues, and ideas continuously emerge, while the semantics of the existing conversation topics gradually shift. In recent years, pre-trained language models like BERT greatly improved the state-of-the-art for a large spectrum of content understanding tasks. Therefore, in this paper, we aim to study how these language models can be adapted to better handle continuously evolving web content. In our study, we first analyze the evolution of 2013 - 2019 Twitter data, and unequivocally confirm that a BERT model trained on past tweets would heavily deteriorate when directly applied to data from later years. Then, we investigate two possible sources of the deterioration: the semantic shift of existing tokens and the sub-optimal or failed understanding of new tokens. To this end, we both explore two different vocabulary composition methods, as well as propose three sampling methods which help in efficient incremental training for BERT-like models. Compared to a new model trained from scratch offline, our incremental training (a) reduces the training costs, (b) achieves better performance on evolving content, and (c) is suitable for online deployment. The superiority of our methods is validated using two downstream tasks. We demonstrate significant improvements when incrementally evolving the model from a particular base year, on the task of Country Hashtag Prediction, as well as on the OffensEval 2019 task.
The design of machines and algorithms capable of learning in a dynamically changing environment has become an increasingly topical problem with the increase of the size and heterogeneity of data available to learning systems. As a consequence, the key issue of Continual Learning has become that of addressing the stability-plasticity dilemma of connectionist systems, as they need to adapt their model without forgetting previously acquired knowledge. Within this context, rehearsal-based methods i.e., solutions in where the learner exploits memory to revisit past data, has proven to be very effective, leading to performance at the state-of-the-art. In our study, we propose an analysis of the memory quantity/quality trade-off adopting various data reduction approaches to increase the number of instances storable in memory. In particular, we investigate complex instance compression techniques such as deep encoders, but also trivial approaches such as image resizing and linear dimensionality reduction. Our findings suggest that the optimal trade-off is severely skewed toward instance quantity, where rehearsal approaches with several heavily compressed instances easily outperform state-of-the-art approaches with the same amount of memory at their disposal. Further, in high memory configurations, deep approaches extracting spatial structure combined with extreme resizing (of the order of $8\times8$ images) yield the best results, while in memory-constrained configurations where deep approaches cannot be used due to their memory requirement in training, Extreme Learning Machines (ELM) offer a clear advantage.
Automatic music generation has become an epicenter research topic for many scientists in artificial intelligence, who are also interested in the music industry. Being a balanced combination of math and art, music in collaboration with A.I. can simplify the generation process for new musical pieces, and ease the interpretation of it to a tangible level. On the other hand, the artistic nature of music and its mingling with the senses and feelings of the composer makes the artificial generation and mathematical modeling of it infeasible. In fact, there are no clear evaluation measures that can combine the objective music grammar and structure with the subjective audience satisfaction goal. Also, original music contains different elements that it is inevitable to put together. Therefore, in this paper, a method based on a genetic multi-objective evolutionary optimization algorithm for the generation of polyphonic music (melody with rhythm and harmony or appropriate chords) is introduced in which three specific goals determine the qualifications of the music generated. One of the goals is the rules and regulations of music, which, along with the other two goals, including the scores of music experts and ordinary listeners, fits the cycle of evolution to get the most optimal response. The scoring of experts and listeners separately is modeled using a Bi-LSTM neural network and has been incorporated in the fitness function of the algorithm. The results show that the proposed method is able to generate difficult and pleasant pieces with desired styles and lengths, along with harmonic sounds that follow the grammar while attracting the listener, at the same time.
Digital video pervades daily life. Mobile video, digital TV, and digital cinema are now ubiquitous, and as such, the field of Digital Video Processing (DVP) has experienced tremendous growth. Digital video systems also permeate scientific and engineering disciplines including but not limited to astronomy, communications, surveillance, entertainment, video coding, computer vision, and vision research. As a consequence, educational tools for DVP must cater to a large and diverse base of students. Towards enhancing DVP education we have created a carefully constructed gallery of educational tools that is designed to complement a comprehensive corpus of online lectures by providing examples of DVP on real-world content, along with a user-friendly interface that organizes numerous key DVP topics ranging from analog video, to human visual processing, to modern video codecs, etc. This demonstration gallery is currently being used effectively in the graduate class ``Digital Video'' at the University of Texas at Austin. Students receive enhanced access to concepts through both learning theory from highly visual lectures and watching concrete examples from the gallery, which captures the beauty of the underlying principles of modern video processing. To better understand the educational value of these tools, we conducted a pair of questionaire-based surveys to assess student background, expectations, and outcomes. The survey results support the teaching efficacy of this new didactic video toolset.
Object 6D pose estimation is an important research topic in the field of computer vision due to its wide application requirements and the challenges brought by complexity and changes in the real-world. We think fully exploring the characteristics of spatial relationship between points will help to improve the pose estimation performance, especially in the scenes of background clutter and partial occlusion. But this information was usually ignored in previous work using RGB image or RGB-D data. In this paper, we propose a framework for 6D pose estimation from RGB-D data based on spatial structure characteristics of 3D keypoints. We adopt point-wise dense feature embedding to vote for 3D keypoints, which makes full use of the structure information of the rigid body. After the direction vectors pointing to the keypoints are predicted by CNN, we use RANSAC voting to calculate the coordinate of the 3D keypoints, then the pose transformation can be easily obtained by the least square method. In addition, a spatial dimension sampling strategy for points is employed, which makes the method achieve excellent performance on small training sets. The proposed method is verified on two benchmark datasets, LINEMOD and OCCLUSION LINEMOD. The experimental results show that our method outperforms the state-of-the-art approaches, achieves ADD(-S) accuracy of 98.7\% on LINEMOD dataset and 52.6\% on OCCLUSION LINEMOD dataset in real-time.
Coronavirus disease 2019 (COVID-19) is one of the most destructive pandemic after millennium, forcing the world to tackle a health crisis. Automated classification of lung infections from chest X-ray (CXR) images strengthened traditional healthcare strategy to handle COVID-19. However, classifying COVID-19 from pneumonia cases using CXR image is challenging because of shared spatial characteristics, high feature variation in infections and contrast diversity between cases. Moreover, massive data collection is impractical for a newly emerged disease, which limited the performance of common deep learning models. To address this challenging topic, Multiscale Attention Guided deep network with Soft Distance regularization (MAG-SD) is proposed to automatically classify COVID-19 from pneumonia CXR images. In MAG-SD, MA-Net is used to produce prediction vector and attention map from multiscale feature maps. To relieve the shortage of training data, attention guided augmentations along with a soft distance regularization are posed, which requires a few labeled data to generate meaningful augmentations and reduce noise. Our multiscale attention model achieves better classification performance on our pneumonia CXR image dataset. Plentiful experiments are proposed for MAG-SD which demonstrates that it has its unique advantage in pneumonia classification over cuttingedge models. The code is available at https://github.com/ JasonLeeGHub/MAG-SD.
Continuous learning from streaming data is among the most challenging topics in the contemporary machine learning. In this domain, learning algorithms must not only be able to handle massive volumes of rapidly arriving data, but also adapt themselves to potential emerging changes. The phenomenon of the evolving nature of data streams is known as concept drift. While there is a plethora of methods designed for detecting its occurrence, all of them assume that the drift is connected with underlying changes in the source of data. However, one must consider the possibility of a malicious injection of false data that simulates a concept drift. This adversarial setting assumes a poisoning attack that may be conducted in order to damage the underlying classification system by forcing adaptation to false data. Existing drift detectors are not capable of differentiating between real and adversarial concept drift. In this paper, we propose a framework for robust concept drift detection in the presence of adversarial and poisoning attacks. We introduce the taxonomy for two types of adversarial concept drifts, as well as a robust trainable drift detector. It is based on the augmented Restricted Boltzmann Machine with improved gradient computation and energy function. We also introduce Relative Loss of Robustness - a novel measure for evaluating the performance of concept drift detectors under poisoning attacks. Extensive computational experiments, conducted on both fully and sparsely labeled data streams, prove the high robustness and efficacy of the proposed drift detection framework in adversarial scenarios.
Designing task-oriented dialogue systems is a challenging research topic, since it needs not only to generate utterances fulfilling user requests but also to guarantee the comprehensibility. Many previous works trained end-to-end (E2E) models with supervised learning (SL), however, the bias in annotated system utterances remains as a bottleneck. Reinforcement learning (RL) deals with the problem through using non-differentiable evaluation metrics (e.g., the success rate) as rewards. Nonetheless, existing works with RL showed that the comprehensibility of generated system utterances could be corrupted when improving the performance on fulfilling user requests. In our work, we (1) propose modelling the hierarchical structure between dialogue policy and natural language generator (NLG) with the option framework, called HDNO; (2) train HDNO with hierarchical reinforcement learning (HRL), as well as suggest alternating updates between dialogue policy and NLG during HRL inspired by fictitious play, to preserve the comprehensibility of generated system utterances while improving fulfilling user requests; and (3) propose using a discriminator modelled with language models as an additional reward to further improve the comprehensibility. We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA, showing a significant improvement on the total performance evaluated with automatic metrics.