Transformer-based models have led to a significant innovation in various classic and practical subjects, including speech processing, natural language processing, and computer vision. On top of the transformer, the attention-based end-to-end automatic speech recognition (ASR) models have become a popular fashion in recent years. Specifically, the non-autoregressive modeling, which can achieve fast inference speed and comparable performance when compared to conventional autoregressive methods, is an emergent research topic. In the context of natural language processing, the bidirectional encoder representations from transformers (BERT) model has received widespread attention, partially due to its ability to infer contextualized word representations and to obtain superior performances of downstream tasks by performing only simple fine-tuning. In order to not only inherit the advantages of non-autoregressive ASR modeling, but also receive benefits from a pre-trained language model (e.g., BERT), a non-autoregressive transformer-based end-to-end ASR model based on BERT is presented in this paper. A series of experiments conducted on the AISHELL-1 dataset demonstrates competitive or superior results of the proposed model when compared to state-of-the-art ASR systems.
Interpretability in machine learning (ML) is crucial for high stakes decisions and troubleshooting. In this work, we provide fundamental principles for interpretable ML, and dispel common misunderstandings that dilute the importance of this crucial topic. We also identify 10 technical challenge areas in interpretable machine learning and provide history and background on each problem. Some of these problems are classically important, and some are recent problems that have arisen in the last few years. These problems are: (1) Optimizing sparse logical models such as decision trees; (2) Optimization of scoring systems; (3) Placing constraints into generalized additive models to encourage sparsity and better interpretability; (4) Modern case-based reasoning, including neural networks and matching for causal inference; (5) Complete supervised disentanglement of neural networks; (6) Complete or even partial unsupervised disentanglement of neural networks; (7) Dimensionality reduction for data visualization; (8) Machine learning models that can incorporate physics and other generative or causal constraints; (9) Characterization of the "Rashomon set" of good models; and (10) Interpretable reinforcement learning. This survey is suitable as a starting point for statisticians and computer scientists interested in working in interpretable machine learning.
Gait recognition is an appealing biometric modality which aims to identify individuals based on the way they walk. Deep learning has reshaped the research landscape in this area since 2015 through the ability to automatically learn discriminative representations. Gait recognition methods based on deep learning now dominate the state-of-the-art in the field and have fostered real-world applications. In this paper, we present a comprehensive overview of breakthroughs and recent developments in gait recognition with deep learning, and cover broad topics including datasets, test protocols, state-of-the-art solutions, challenges, and future research directions. We first review the commonly used gait datasets along with the principles designed for evaluating them. We then propose a novel taxonomy made up of four separate dimensions namely body representation, temporal representation, feature representation, and neural architecture, to help characterize and organize the research landscape and literature in this area. Following our proposed taxonomy, a comprehensive survey of gait recognition methods using deep learning is presented with discussions on their performances, characteristics, advantages, and limitations. We conclude this survey with a discussion on current challenges and mention a number of promising directions for future research in gait recognition.
Unmanned Aerial Vehicle (UAV) offers lots of applications in both commerce and recreation. With this, monitoring the operation status of UAVs is crucially important. In this work, we consider the task of tracking UAVs, providing rich information such as location and trajectory. To facilitate research on this topic, we propose a dataset, Anti-UAV, with more than 300 video pairs containing over 580k manually annotated bounding boxes. The releasing of such a large-scale dataset could be a useful initial step in research of tracking UAVs. Furthermore, the advancement of addressing research challenges in Anti-UAV can help the design of anti-UAV systems, leading to better surveillance of UAVs. Besides, a novel approach named dual-flow semantic consistency (DFSC) is proposed for UAV tracking. Modulated by the semantic flow across video sequences, the tracker learns more robust class-level semantic information and obtains more discriminative instance-level features. Experimental results demonstrate that Anti-UAV is very challenging, and the proposed method can effectively improve the tracker's performance. The Anti-UAV benchmark and the code of the proposed approach will be publicly available at https://github.com/ucas-vg/Anti-UAV.
Mining Electronic Health Records (EHRs) becomes a promising topic because of the rich information they contain. By learning from EHRs, machine learning models can be built to help human experts to make medical decisions and thus improve healthcare quality. Recently, many models based on sequential or graph models are proposed to achieve this goal. EHRs contain multiple entities and relations and can be viewed as a heterogeneous graph. However, previous studies ignore the heterogeneity in EHRs. On the other hand, current heterogeneous graph neural networks cannot be simply used on an EHR graph because of the existence of hub nodes in it. To address this issue, we propose Heterogeneous Similarity Graph Neural Network (HSGNN) analyze EHRs with a novel heterogeneous GNN. Our framework consists of two parts: one is a preprocessing method and the other is an end-to-end GNN. The preprocessing method normalizes edges and splits the EHR graph into multiple homogeneous graphs while each homogeneous graph contains partial information of the original EHR graph. The GNN takes all homogeneous graphs as input and fuses all of them into one graph to make a prediction. Experimental results show that HSGNN outperforms other baselines in the diagnosis prediction task.
The amount of textual data generation has increased enormously due to the effortless access of the Internet and the evolution of various web 2.0 applications. These textual data productions resulted because of the people express their opinion, emotion or sentiment about any product or service in the form of tweets, Facebook post or status, blog write up, and reviews. Sentiment analysis deals with the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude toward a particular topic is positive, negative, or neutral. The impact of customer review is significant to perceive the customer attitude towards a restaurant. Thus, the automatic detection of sentiment from reviews is advantageous for the restaurant owners, or service providers and customers to make their decisions or services more satisfactory. This paper proposes, a deep learning-based technique (i.e., BiLSTM) to classify the reviews provided by the clients of the restaurant into positive and negative polarities. A corpus consists of 8435 reviews is constructed to evaluate the proposed technique. In addition, a comparative analysis of the proposed technique with other machine learning algorithms presented. The results of the evaluation on test dataset show that BiLSTM technique produced in the highest accuracy of 91.35%.
Attribute image manipulation has been a very active topic since the introduction of Generative Adversarial Networks (GANs). Exploring the disentangled attribute space within a transformation is a very challenging task due to the multiple and mutually-inclusive nature of the facial images, where different labels (eyeglasses, hats, hair, identity, etc.) can co-exist at the same time. Several works address this issue either by exploiting the modality of each domain/attribute using a conditional random vector noise, or extracting the modality from an exemplary image. However, existing methods cannot handle both random and reference transformations for multiple attributes, which limits the generality of the solutions. In this paper, we successfully exploit a multimodal representation that handles all attributes, be it guided by random noise or exemplar images, while only using the underlying domain information of the target domain. We present extensive qualitative and quantitative results for facial datasets and several different attributes that show the superiority of our method. Additionally, our method is capable of adding, removing or changing either fine-grained or coarse attributes by using an image as a reference or by exploring the style distribution space, and it can be easily extended to head-swapping and face-reenactment applications without being trained on videos.
Whether to give rights to artificial intelligence (AI) and robots has been a sensitive topic since the European Parliament proposed advanced robots could be granted "electronic personalities." Numerous scholars who favor or disfavor its feasibility have participated in the debate. This paper presents an experiment (N=1270) that 1) collects online users' first impressions of 11 possible rights that could be granted to autonomous electronic agents of the future and 2) examines whether debunking common misconceptions on the proposal modifies one's stance toward the issue. The results indicate that even though online users mainly disfavor AI and robot rights, they are supportive of protecting electronic agents from cruelty (i.e., favor the right against cruel treatment). Furthermore, people's perceptions became more positive when given information about rights-bearing non-human entities or myth-refuting statements. The style used to introduce AI and robot rights significantly affected how the participants perceived the proposal, similar to the way metaphors function in creating laws. For robustness, we repeated the experiment over a more representative sample of U.S. residents (N=164) and found that perceptions gathered from online users and those by the general population are similar.