Previous studies have shown that health reports in social media, such as DailyStrength and Twitter, have potential for monitoring health conditions (e.g. adverse drug reactions, infectious diseases) in particular communities. However, in order for a machine to understand and make inferences on these health conditions, the ability to recognise when laymen's terms refer to a particular medical concept (i.e.\ text normalisation) is required. To achieve this, we propose to adapt an existing phrase-based machine translation (MT) technique and a vector representation of words to map between a social media phrase and a medical concept. We evaluate our proposed approach using a collection of phrases from tweets related to adverse drug reactions. Our experimental results show that the combination of a phrase-based MT technique and the similarity between word vector representations outperforms the baselines that apply only either of them by up to 55%.
Statistics pedagogy values using a variety of examples. Thanks to text resources on the Web, and since statistical packages have the ability to analyze string data, it is now easy to use language-based examples in a statistics class. Three such examples are discussed here. First, many types of wordplay (e.g., crosswords and hangman) involve finding words with letters that satisfy a certain pattern. Second, linguistics has shown that idiomatic pairs of words often appear together more frequently than chance. For example, in the Brown Corpus, this is true of the phrasal verb to throw up (p-value=7.92E-10.) Third, a pangram contains all the letters of the alphabet at least once. These are searched for in Charles Dickens' A Christmas Carol, and their lengths are compared to the expected value given by the unequal probability coupon collector's problem as well as simulations.
Social media has been considered as a data source for tracking disease. However, most analyses are based on models that prioritize strong correlation with population-level disease rates over determining whether or not specific individual users are actually sick. Taking a different approach, we develop a novel system for social-media based disease detection at the individual level using a sample of professionally diagnosed individuals. Specifically, we develop a system for making an accurate influenza diagnosis based on an individual's publicly available Twitter data. We find that about half (17/35 = 48.57%) of the users in our sample that were sick explicitly discuss their disease on Twitter. By developing a meta classifier that combines text analysis, anomaly detection, and social network analysis, we are able to diagnose an individual with greater than 99% accuracy even if she does not discuss her health.
MPEG-7 (Moving Picture Experts Group Phase 7) is an XML-based international standard on semantic description of multimedia content. This document discusses the Linguistic DS and related tools. The linguistic DS is a tool, based on the GDA tag set (http://i-content.org/GDA/tagset.html), for semantic annotation of linguistic data in or associated with multimedia content. The current document text reflects `Study of FPDAM - MPEG-7 MDS Extensions' issued in March 2003, and not most part of MPEG-7 MDS, for which the readers are referred to the first version of MPEG-7 MDS document available from ISO (http://www.iso.org). Without that reference, however, this document should be mostly intelligible to those who are familiar with XML and linguistic theories. Comments are welcome and will be considered in the standardization process.
Robert French has argued that a disembodied computer is incapable of passing a Turing Test that includes subcognitive questions. Subcognitive questions are designed to probe the network of cultural and perceptual associations that humans naturally develop as we live, embodied and embedded in the world. In this paper, I show how it is possible for a disembodied computer to answer subcognitive questions appropriately, contrary to French's claim. My approach to answering subcognitive questions is to use statistical information extracted from a very large collection of text. In particular, I show how it is possible to answer a sample of subcognitive questions taken from French, by issuing queries to a search engine that indexes about 350 million Web pages. This simple algorithm may shed light on the nature of human (sub-) cognition, but the scope of this paper is limited to demonstrating that French is mistaken: a disembodied computer can answer subcognitive questions.
Semantic knowledge can be a great asset to natural language processing systems, but it is usually hand-coded for each application. Although some semantic information is available in general-purpose knowledge bases such as WordNet and Cyc, many applications require domain-specific lexicons that represent words and categories for a particular topic. In this paper, we present a corpus-based method that can be used to build semantic lexicons for specific categories. The input to the system is a small set of seed words for a category and a representative text corpus. The output is a ranked list of words that are associated with the category. A user then reviews the top-ranked words and decides which ones should be entered in the semantic lexicon. In experiments with five categories, users typically found about 60 words per category in 10-15 minutes to build a core semantic lexicon.
There are two main methodologies for constructing the knowledge base of a natural language analyser: the linguistic and the data-driven. Recent state-of-the-art part-of-speech taggers are based on the data-driven approach. Because of the known feasibility of the linguistic rule-based approach at related levels of description, the success of the data-driven approach in part-of-speech analysis may appear surprising. In this paper, a case is made for the syntactic nature of part-of-speech tagging. A new tagger of English that uses only linguistic distributional rules is outlined and empirically evaluated. Tested against a benchmark corpus of 38,000 words of previously unseen text, this syntax-based system reaches an accuracy of above 99%. Compared to the 95-97% accuracy of its best competitors, this result suggests the feasibility of the linguistic approach also in part-of-speech analysis.
Despite recent advances in abstractive summarization, current summarization systems still suffer from content hallucinations where models generate text that is either irrelevant or contradictory to the source document. However, prior work has been predicated on the assumption that any generated facts not appearing explicitly in the source are undesired hallucinations. Methods have been proposed to address this scenario by ultimately improving `faithfulness' to the source document, but in reality, there is a large portion of entities in the gold reference targets that are not directly in the source. In this work, we show that these entities are not aberrations, but they instead require utilizing external world knowledge to infer reasoning paths from entities in the source. We show that by utilizing an external knowledge base, we can improve the faithfulness of summaries without simply making them more extractive, and additionally, we show that external knowledge bases linked from the source can benefit the factuality of generated summaries.
Speaker clustering is an essential step in conventional speaker diarization systems and is typically addressed as an audio-only speech processing task. The language used by the participants in a conversation, however, carries additional information that can help improve the clustering performance. This is especially true in conversational interactions, such as business meetings, interviews, and lectures, where specific roles assumed by interlocutors (manager, client, teacher, etc.) are often associated with distinguishable linguistic patterns. In this paper we propose to employ a supervised text-based model to extract speaker roles and then use this information to guide an audio-based spectral clustering step by imposing must-link and cannot-link constraints between segments. The proposed method is applied on two different domains, namely on medical interactions and on podcast episodes, and is shown to yield improved results when compared to the audio-only approach.
In recent years , there has been an upsurge in a new form of entertainment medium called memes. These memes although seemingly innocuous have transcended onto the boundary of online harassment against women and created an unwanted bias against them . To help alleviate this problem , we propose an early fusion model for prediction and identification of misogynistic memes and its type in this paper for which we participated in SemEval-2022 Task 5 . The model receives as input meme image with its text transcription with a target vector. Given that a key challenge with this task is the combination of different modalities to predict misogyny, our model relies on pretrained contextual representations from different state-of-the-art transformer-based language models and pretrained image pretrained models to get an effective image representation. Our model achieved competitive results on both SubTask-A and SubTask-B with the other competition teams and significantly outperforms the baselines.