Twitter is often used in quantitative studies that identify geographically-preferred topics, writing styles, and entities. These studies rely on either GPS coordinates attached to individual messages, or on the user-supplied location field in each profile. In this paper, we compare these data acquisition techniques and quantify the biases that they introduce; we also measure their effects on linguistic analysis and text-based geolocation. GPS-tagging and self-reported locations yield measurably different corpora, and these linguistic differences are partially attributable to differences in dataset composition by age and gender. Using a latent variable model to induce age and gender, we show how these demographic variables interact with geography to affect language use. We also show that the accuracy of text-based geolocation varies with population demographics, giving the best results for men above the age of 40.
Programming question and answer (Q & A) websites, such as Quora, Stack Overflow, and Yahoo! Answer etc. helps us to understand the programming concepts easily and quickly in a way that has been tested and applied by many software developers. Stack Overflow is one of the most frequently used programming Q\&A website where the questions and answers posted are presently analyzed manually, which requires a huge amount of time and resource. To save the effort, we present a topic modeling based technique to analyze the words of the original texts to discover the themes that run through them. We also propose a method to automate the process of reviewing the quality of questions on Stack Overflow dataset in order to avoid ballooning the stack overflow with insignificant questions. The proposed method also recommends the appropriate tags for the new post, which averts the creation of unnecessary tags on Stack Overflow.
The main disadvantage of Magnetic Resonance Imaging (MRI) are its long scan times and, in consequence, its sensitivity to motion. Exploiting the complementary information from multiple receive coils, parallel imaging is able to recover images from under-sampled k-space data and to accelerate the measurement. Because parallel magnetic resonance imaging can be used to accelerate basically any imaging sequence it has many important applications. Parallel imaging brought a fundamental shift in image reconstruction: Image reconstruction changed from a simple direct Fourier transform to the solution of an ill-conditioned inverse problem. This work gives an overview of image reconstruction from the perspective of inverse problems. After introducing basic concepts such as regularization, discretization, and iterative reconstruction, advanced topics are discussed including algorithms for auto-calibration, the connection to approximation theory, and the combination with compressed sensing.
Universal Grammar (UG) theory has been one of the most important research topics in linguistics since introduced five decades ago. UG specifies the restricted set of languages learnable by human brain, and thus, many researchers believe in its biological roots. Numerous empirical studies of neurobiological and cognitive functions of the human brain, and of many natural languages, have been conducted to unveil some aspects of UG. This, however, resulted in different and sometimes contradicting theories that do not indicate a universally unique grammar. In this research, we tackle the UG problem from an entirely different perspective. We search for the Unique Universal Grammar (UUG) that facilitates communication and knowledge transfer, the sole purpose of a language. We formulate this UG and show that it is unique, intrinsic, and cosmic, rather than humanistic. Initial analysis on a widespread natural language already showed some positive results.
The riots in Stockholm in May 2013 were an event that reverberated in the world media for its dimension of violence that had spread through the Swedish capital. In this study we have investigated the role of social media in creating media phenomena via text mining and natural language processing. We have focused on two channels of communication for our analysis: Twitter and Poloniainfo.se (Forum of Polish community in Sweden). Our preliminary results show some hot topics driving discussion related mostly to Swedish Police and Swedish Politics by counting word usage. Typical features for media intervention are presented. We have built networks of most popular phrases, clustered by categories (geography, media institution, etc.). Sentiment analysis shows negative connotation with Police. The aim of this preliminary exploratory quantitative study was to generate questions and hypotheses, which we could carefully follow by deeper more qualitative methods.
The development of compositional distributional models of semantics reconciling the empirical aspects of distributional semantics with the compositional aspects of formal semantics is a popular topic in the contemporary literature. This paper seeks to bring this reconciliation one step further by showing how the mathematical constructs commonly used in compositional distributional models, such as tensors and matrices, can be used to simulate different aspects of predicate logic. This paper discusses how the canonical isomorphism between tensors and multilinear maps can be exploited to simulate a full-blown quantifier-free predicate calculus using tensors. It provides tensor interpretations of the set of logical connectives required to model propositional calculi. It suggests a variant of these tensor calculi capable of modelling quantifiers, using few non-linear operations. It finally discusses the relation between these variants, and how this relation should constitute the subject of future work.
This paper reports the findings of a study conducted on the application of an on-line human-computer dialog system with natural language (chatbot) on the teaching of foreign languages. A keywords-based human-computer dialog system makes it possible that the user could chat with the computer using a natural language, i.e. in English or in German to some extent. So an experiment has been made using this system online to work as a chat partner with the users learning the foreign languages. Dialogs between the users and the chatbot are collected. Findings indicate that the dialogs between the human and the computer are mostly very short because the user finds the responses from the computer are mostly repeated and irrelevant with the topics and context and the program does not understand the language at all. With analysis of the keywords or pattern-matching mechanism used in this chatbot it can be concluded that this kind of system can not work as a teaching assistant program in foreign language learning.
This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effectiveness of our approach in two very different domains, Wall Street Journal articles and the TDT Corpus, a collection of newswire articles and broadcast news transcripts.
Semantic knowledge can be a great asset to natural language processing systems, but it is usually hand-coded for each application. Although some semantic information is available in general-purpose knowledge bases such as WordNet and Cyc, many applications require domain-specific lexicons that represent words and categories for a particular topic. In this paper, we present a corpus-based method that can be used to build semantic lexicons for specific categories. The input to the system is a small set of seed words for a category and a representative text corpus. The output is a ranked list of words that are associated with the category. A user then reviews the top-ranked words and decides which ones should be entered in the semantic lexicon. In experiments with five categories, users typically found about 60 words per category in 10-15 minutes to build a core semantic lexicon.
As the use of AI-powered applications widens across multiple domains, so do increase the computational demands. Primary driver of AI technology are the deep neural networks (DNNs). When focusing either on cloud-based systems that serve multiple AI queries from different users each with their own DNN model, or on mobile robots and smartphones employing pipelines of various models or parallel DNNs for the concurrent processing of multi-modal data, the next generation of AI systems will have multi-DNN workloads at their core. Large-scale deployment of AI services and integration across mobile and embedded systems require additional breakthroughs in the computer architecture front, with processors that can maintain high performance as the number of DNNs increases while meeting the quality-of-service requirements, giving rise to the topic of multi-DNN accelerator design.