By using a computer keyboard as a finger recording device, we construct the largest existing dataset for gesture recognition via surface electromyography (sEMG), and use deep learning to achieve over 90% character-level accuracy on reconstructing typed text entirely from measured muscle potentials. We prioritize the temporal structure of the EMG signal instead of the spatial structure of the electrode layout, using network architectures inspired by those used for real-time spoken language transcription. Our architecture recognizes the rapid movements of natural computer typing, which occur at irregular intervals and often overlap in time. The extensive size of our dataset also allows us to study gesture recognition after synthetically downgrading the spatial or temporal resolution, showing the system capabilities necessary for real-time gesture recognition.
By using a computer keyboard as a finger recording device, we construct the largest existing dataset for gesture recognition via surface electromyography (sEMG), and use deep learning to achieve over 90% character-level accuracy on reconstructing typed text entirely from measured muscle potentials. We prioritize the temporal structure of the EMG signal instead of the spatial structure of the electrode layout, using network architectures inspired by those used for real-time spoken language transcription. Our architecture recognizes the rapid movements of natural computer typing, which occur at irregular intervals and often overlap in time. The extensive size of our dataset also allows us to study gesture recognition after synthetically downgrading the spatial or temporal resolution, showing the system capabilities necessary for real-time gesture recognition.
In this paper, we explore possible improvements of transformer models in a low-resource setting. In particular, we present our approaches to tackle the first two of three subtasks of the MEDDOPROF competition, i.e., the extraction and classification of job expressions in Spanish clinical texts. As neither language nor domain experts, we experiment with the multilingual XLM-R transformer model and tackle these low-resource information extraction tasks as sequence-labeling problems. We explore domain- and language-adaptive pretraining, transfer learning and strategic datasplits to boost the transformer model. Our results show strong improvements using these methods by up to 5.3 F1 points compared to a fine-tuned XLM-R model. Our best models achieve 83.2 and 79.3 F1 for the first two tasks, respectively.
Standard accuracy metrics have shown that Math Word Problem (MWP) solvers have achieved high performance on benchmark datasets. However, the extent to which existing MWP solvers truly understand language and its relation with numbers is still unclear. In this paper, we generate adversarial attacks to evaluate the robustness of state-of-the-art MWP solvers. We propose two methods Question Reordering and Sentence Paraphrasing to generate adversarial attacks. We conduct experiments across three neural MWP solvers over two benchmark datasets. On average, our attack method is able to reduce the accuracy of MWP solvers by over 40 percentage points on these datasets. Our results demonstrate that existing MWP solvers are sensitive to linguistic variations in the problem text. We verify the validity and quality of generated adversarial examples through human evaluation.
Pre-training visual and textual representations from large-scale image-text pairs is becoming a standard approach for many downstream vision-language tasks. The transformer-based models learn inter and intra-modal attention through a list of self-supervised learning tasks. This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. We also propose a new evaluation metric measuring the similarity between the learnt visual and textual embedding. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment in the joint feature embedding space
In this paper, we propose a novel edge-editing approach to extract relation information from a document. We treat the relations in a document as a relation graph among entities in this approach. The relation graph is iteratively constructed by editing edges of an initial graph, which might be a graph extracted by another system or an empty graph. The way to edit edges is to classify them in a close-first manner using the document and temporally-constructed graph information; each edge is represented with a document context information by a pretrained transformer model and a graph context information by a graph convolutional neural network model. We evaluate our approach on the task to extract material synthesis procedures from materials science texts. The experimental results show the effectiveness of our approach in editing the graphs initialized by our in-house rule-based system and empty graphs.
Language modeling (LM) for automatic speech recognition (ASR) does not usually incorporate utterance level contextual information. For some domains like voice assistants, however, additional context, such as the time at which an utterance was spoken, provides a rich input signal. We introduce an attention mechanism for training neural speech recognition language models on both text and non-linguistic contextual data. When applied to a large de-identified dataset of utterances collected by a popular voice assistant platform, our method reduces perplexity by 7.0% relative over a standard LM that does not incorporate contextual information. When evaluated on utterances extracted from the long tail of the dataset, our method improves perplexity by 9.0% relative over a standard LM and by over 2.8% relative when compared to a state-of-the-art model for contextual LM.
Discourse relations among arguments reveal logical structures of a debate conversation. However, no prior work has explicitly studied how the sequence of discourse relations influence a claim's impact. This paper empirically shows that the discourse relations between two arguments along the context path are essential factors for identifying the persuasive power of an argument. We further propose DisCOC to inject and fuse the sentence-level structural discourse information with contextualized features derived from large-scale language models. Experimental results and extensive analysis show that the attention and gate mechanisms that explicitly model contexts and texts can indeed help the argument impact classification task defined by Durmus et al. (2019), and discourse structures among the context path of the claim to be classified can further boost the performance.
A number of problems in the processing of sound and natural language, as well as in other areas, can be reduced to simultaneously reading an input sequence and writing an output sequence of generally different length. There are well developed methods that produce the output sequence based on the entirely known input. However, efficient methods that enable such transformations on-line do not exist. In this paper we introduce an architecture that learns with reinforcement to make decisions about whether to read a token or write another token. This architecture is able to transform potentially infinite sequences on-line. In an experimental study we compare it with state-of-the-art methods for neural machine translation. While it produces slightly worse translations than Transformer, it outperforms the autoencoder with attention, even though our architecture translates texts on-line thereby solving a more difficult problem than both reference methods.
Current storytelling systems focus more ongenerating stories with coherent plots regard-less of the narration style, which is impor-tant for controllable text generation. There-fore, we propose a new task, stylized story gen-eration, namely generating stories with speci-fied style given a leading context. To tacklethe problem, we propose a novel generationmodel that first plans the stylized keywordsand then generates the whole story with theguidance of the keywords. Besides, we pro-pose two automatic metrics to evaluate theconsistency between the generated story andthe specified style. Experiments demonstratesthat our model can controllably generateemo-tion-driven orevent-driven stories based onthe ROCStories dataset (Mostafazadeh et al.,2016). Our study presents insights for stylizedstory generation in further research.