Social media platforms may provide potential space for discourses that contain hate speech, and even worse, can act as a propagation mechanism for hate crimes. The FBI's Uniform Crime Reporting (UCR) Program collects hate crime data and releases statistic report yearly. These statistics provide information in determining national hate crime trends. The statistics can also provide valuable holistic and strategic insight for law enforcement agencies or justify lawmakers for specific legislation. However, the reports are mostly released next year and lag behind many immediate needs. Recent research mainly focuses on hate speech detection in social media text or empirical studies on the impact of a confirmed crime. This paper proposes a framework that first utilizes text mining techniques to extract hate crime events from New York Times news, then uses the results to facilitate predicting American national-level and state-level hate crime trends. Experimental results show that our method can significantly enhance the prediction performance compared with time series or regression methods without event-related factors. Our framework broadens the methods of national-level and state-level hate crime trends prediction.
As the volume of long-form spoken-word content such as podcasts explodes, many platforms desire to present short, meaningful, and logically coherent segments extracted from the full content. Such segments can be consumed by users to sample content before diving in, as well as used by the platform to promote and recommend content. However, little published work is focused on the segmentation of spoken-word content, where the errors (noise) in transcripts generated by automatic speech recognition (ASR) services poses many challenges. Here we build a novel dataset of complete transcriptions of over 400 podcast episodes, in which we label the position of introductions in each episode. These introductions contain information about the episodes' topics, hosts, and guests, providing a valuable summary of the episode content, as it is created by the authors. We further augment our dataset with word substitutions to increase the amount of available training data. We train three Transformer models based on the pre-trained BERT and different augmentation strategies, which achieve significantly better performance compared with a static embedding model, showing that it is possible to capture generalized, larger-scale structural information from noisy, loosely-organized speech data. This is further demonstrated through an analysis of the models' inner architecture. Our methods and dataset can be used to facilitate future work on the structure-based segmentation of spoken-word content.
In an emergency room (ER) setting, the diagnosis of stroke is a common challenge. Due to excessive execution time and cost, an MRI scan is usually not available in the ER. Clinical tests are commonly referred to in stroke screening, but neurologists may not be immediately available. We propose a novel multimodal deep learning framework, DeepStroke, to achieve computer-aided stroke presence assessment by recognizing the patterns of facial motion incoordination and speech inability for patients with suspicion of stroke in an acute setting. Our proposed DeepStroke takes video data for local facial paralysis detection and audio data for global speech disorder analysis. It further leverages a multi-modal lateral fusion to combine the low- and high-level features and provides mutual regularization for joint training. A novel adversarial training loss is also introduced to obtain identity-independent and stroke-discriminative features. Experiments on our video-audio dataset with actual ER patients show that the proposed approach outperforms state-of-the-art models and achieves better performance than ER doctors, attaining a 6.60% higher sensitivity and maintaining 4.62% higher accuracy when specificity is aligned. Meanwhile, each assessment can be completed in less than 6 minutes, demonstrating the framework's great potential for clinical implementation.
Significant progress has been made in deep-learning based Automatic Essay Scoring (AES) systems in the past two decades. However, little research has been put to understand and interpret the black-box nature of these deep-learning based scoring models. Recent work shows that automated scoring systems are prone to even common-sense adversarial samples. Their lack of natural language understanding capability raises questions on the models being actively used by millions of candidates for life-changing decisions. With scoring being a highly multi-modal task, it becomes imperative for scoring models to be validated and tested on all these modalities. We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms and why they are susceptible to adversarial samples. We find that the systems tested consider essays not as a piece of prose having the characteristics of natural flow of speech and grammatical structure, but as `word-soups' where a few words are much more important than the other words. Removing the context surrounding those few important words causes the prose to lose the flow of speech and grammar, however has little impact on the predicted score. We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as ``the world is flat'' actually increases the score instead of decreasing it.
ElectroCOrticoGraphy (ECoG) technology measures electrical activity in the human brain via electrodes placed directly on the cortical surface during neurosurgery. Through its capability to record activity at an extremely fast temporal resolution, ECoG experiments have allowed scientists to better understand how the human brain processes speech. By its nature, ECoG data is extremely difficult for neuroscientists to directly interpret for two major reasons. Firstly, ECoG data tends to be extremely large in size, as each individual experiment yields data up to several GB. Secondly, ECoG data has a complex, higher-order nature; after signal processing, this type of data is typically organized as a 4-way tensor consisting of trials by electrodes by frequency by time. In this paper, we develop an interpretable dimension reduction approach called Regularized Higher Order Principal Components Analysis, as well as an extension to Regularized Higher Order Partial Least Squares, that allows neuroscientists to explore and visualize ECoG data. Our approach employs a sparse and functional Candecomp-Parafac (CP) decomposition that incorporates sparsity to select relevant electrodes and frequency bands, as well as smoothness over time and frequency, yielding directly interpretable factors. We demonstrate both the performance and interpretability of our method with an ECoG case study on audio and visual processing of human speech.
For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC significantly, outperforming a standard CTC model (15.5% $\rightarrow$ 9.1% WER on WSJ). Moreover, Mask-CTC now achieves competitive results to AR models with no degradation of inference speed ($<$ 0.1 RTF using CPU). We also show a potential application of Mask-CTC to end-to-end speech translation.
Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors suggest a strategy for developing language-independent automatic speech recognition: landmarks can potentially be learned once from a suitably labeled corpus and rapidly applied to many other languages. This paper proposes enhancing the cross-lingual portability of a neural network by using landmarks as the secondary task in multi-task learning (MTL). The network is trained in a well-resourced source language with both phone and landmark labels (English), then adapted to an under-resourced target language with only word labels (Iban). Landmark-tasked MTL reduces source-language phone error rate by 2.9% relative, and reduces target-language word error rate by 1.9%-5.9% depending on the amount of target-language training data. These results suggest that landmark-tasked MTL causes the DNN to learn hidden-node features that are useful for cross-lingual adaptation.
We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear sub-word units that are present in speech. In order to incorporate this long-time frequency structure of audio events, we introduce a convolutional neural network (CNN) operating on a large temporal input. In contrast to previous works this allows us to train an audio event detection system end-to-end. The combination of our network architecture and a novel data augmentation outperforms previous methods for audio event detection by 16%. Furthermore, we perform transfer learning and show that our model learnt generic audio features, similar to the way CNNs learn generic features on vision tasks. In video analysis, combining visual features and traditional audio features such as MFCC typically only leads to marginal improvements. Instead, combining visual features with our AENet features, which can be computed efficiently on a GPU, leads to significant performance improvements on action recognition and video highlight detection. In video highlight detection, our audio features improve the performance by more than 8% over visual features alone.
We propose a system for parsing and translating natural language that learns from examples and uses some background knowledge. As our parsing model we choose a deterministic shift-reduce type parser that integrates part-of-speech tagging and syntactic and semantic processing. Applying machine learning techniques, the system uses parse action examples acquired under supervision to generate a parser in the form of a decision structure, a generalization of decision trees. To learn good parsing and translation decisions, our system relies heavily on context, as encoded in currently 205 features describing the morphological, syntactical and semantical aspects of a given parse state. Compared with recent probabilistic systems that were trained on 40,000 sentences, our system relies on more background knowledge and a deeper analysis, but radically fewer examples, currently 256 sentences. We test our parser on lexically limited sentences from the Wall Street Journal and achieve accuracy rates of 89.8% for labeled precision, 98.4% for part of speech tagging and 56.3% of test sentences without any crossing brackets. Machine translations of 32 Wall Street Journal sentences to German have been evaluated by 10 bilingual volunteers and been graded as 2.4 on a 1.0 (best) to 6.0 (worst) scale for both grammatical correctness and meaning preservation.
This paper describes an extension to the constraint satisfaction problem (CSP) called MUSE CSP (MUltiply SEgmented Constraint Satisfaction Problem). This extension is especially useful for those problems which segment into multiple sets of partially shared variables. Such problems arise naturally in signal processing applications including computer vision, speech processing, and handwriting recognition. For these applications, it is often difficult to segment the data in only one way given the low-level information utilized by the segmentation algorithms. MUSE CSP can be used to compactly represent several similar instances of the constraint satisfaction problem. If multiple instances of a CSP have some common variables which have the same domains and constraints, then they can be combined into a single instance of a MUSE CSP, reducing the work required to apply the constraints. We introduce the concepts of MUSE node consistency, MUSE arc consistency, and MUSE path consistency. We then demonstrate how MUSE CSP can be used to compactly represent lexically ambiguous sentences and the multiple sentence hypotheses that are often generated by speech recognition algorithms so that grammar constraints can be used to provide parses for all syntactically correct sentences. Algorithms for MUSE arc and path consistency are provided. Finally, we discuss how to create a MUSE CSP from a set of CSPs which are labeled to indicate when the same variable is shared by more than a single CSP.