Much of the user-generated content on social media is provided by ordinary people telling stories about their daily lives. We develop and test a novel method for learning fine-grained common-sense knowledge from these stories about contingent (causal and conditional) relationships between everyday events. This type of knowledge is useful for text and story understanding, information extraction, question answering, and text summarization. We test and compare different methods for learning contingency relation, and compare what is learned from topic-sorted story collections vs. general-domain stories. Our experiments show that using topic-specific datasets enables learning finer-grained knowledge about events and results in significant improvement over the baselines. An evaluation on Amazon Mechanical Turk shows 82% of the relations between events that we learn from topic-sorted stories are judged as contingent.
We present CommentWatcher, an open source tool aimed at analyzing discussions on web forums. Constructed as a web platform, CommentWatcher features automatic mass fetching of user posts from forum on multiple sites, extracting topics, visualizing the topics as an expression cloud and exploring their temporal evolution. The underlying social network of users is simultaneously constructed using the citation relations between users and visualized as a graph structure. Our platform addresses the issues of the diversity and dynamics of structures of webpages hosting the forums by implementing a parser architecture that is independent of the HTML structure of webpages. This allows easy on-the-fly adding of new websites. Two types of users are targeted: end users who seek to study the discussed topics and their temporal evolution, and researchers in need of establishing a forum benchmark dataset and comparing the performances of analysis tools.
Building recognition and 3D reconstruction of human made structures in urban scenarios has become an interesting and actual topic in the image processing domain. For this research topic the Computer Vision and Augmented Reality areas intersect for creating a better understanding of the urban scenario for various topics. In this paper we aim to introduce a dataset solution, the TMBuD, that is better fitted for image processing on human made structures for urban scene scenarios. The proposed dataset will allow proper evaluation of salient edges and semantic segmentation of images focusing on the street view perspective of buildings. The images that form our dataset offer various street view perspectives of buildings from urban scenarios, which allows for evaluating complex algorithms. The dataset features 160 images of buildings from Timisoara, Romania, with a resolution of 768 x 1024 pixels each.
Nowadays, data analysis has become a problem as the amount of data is constantly increasing. In order to overcome this problem in textual data, many models and methods are used in natural language processing. The topic modeling field is one of these methods. Topic modeling allows determining the semantic structure of a text document. Latent Dirichlet Allocation (LDA) is the most common method among topic modeling methods. In this article, the proposed n-stage LDA method, which can enable the LDA method to be used more effectively, is explained in detail. The positive effect of the method has been demonstrated by the applied English and Turkish studies. Since the method focuses on reducing the word count in the dictionary, it can be used language-independently. You can access the open-source code of the method and the example: https://github.com/anil1055/n-stage_LDA
Collecting together microblogs representing opinions about the same topics within the same timeframe is useful to a number of different tasks and practitioners. A major question is how to evaluate the quality of such thematic clusters. Here we create a corpus of microblog clusters from three different domains and time windows and define the task of evaluating thematic coherence. We provide annotation guidelines and human annotations of thematic coherence by journalist experts. We subsequently investigate the efficacy of different automated evaluation metrics for the task. We consider a range of metrics including surface level metrics, ones for topic model coherence and text generation metrics (TGMs). While surface level metrics perform well, outperforming topic coherence metrics, they are not as consistent as TGMs. TGMs are more reliable than all other metrics considered for capturing thematic coherence in microblog clusters due to being less sensitive to the effect of time windows.
Students' perception of classes measured through their opinions on teaching surveys allows to identify deficiencies and problems, both in the environment and in the learning methodologies. The purpose of this paper is to study, through sentiment analysis using natural language processing (NLP) and machine learning (ML) techniques, those opinions in order to identify topics that are relevant for students, as well as predicting the associated sentiment via polarity analysis. As a result, it is implemented, trained and tested two algorithms to predict the associated sentiment as well as the relevant topics of such opinions. The combination of both approaches then becomes useful to identify specific properties of the students' opinions associated with each sentiment label (positive, negative or neutral opinions) and topic. Furthermore, we explore the possibility that students' perception surveys are carried out without closed questions, relying on the information that students can provide through open questions where they express their opinions about their classes.
Emotion recognition has become a major problem in computer vision in recent years that made a lot of effort by researchers to overcome the difficulties in this task. In the field of affective computing, emotion recognition has a wide range of applications, such as healthcare, robotics, human-computer interaction. Due to its practical importance for other tasks, many techniques and approaches have been investigated for different problems and various data sources. Nevertheless, comprehensive fusion of the audio-visual and language modalities to get the benefits from them is still a problem to solve. In this paper, we present and discuss our classification methodology for MuSe-Topic Sub-challenge, as well as the data and results. For the topic classification, we ensemble two language models which are ALBERT and RoBERTa to predict 10 classes of topics. Moreover, for the classification of valence and arousal, SVM and Random forests are employed in conjunction with feature selection to enhance the performance.
Research on computational argumentation is currently being intensively investigated. The goal of this community is to find the best pro and con arguments for a user given topic either to form an opinion for oneself, or to persuade others to adopt a certain standpoint. While existing argument mining methods can find appropriate arguments for a topic, a correct classification into pro and con is not yet reliable. The same side stance classification task provides a dataset of argument pairs classified by whether or not both arguments share the same stance and does not need to distinguish between topic-specific pro and con vocabulary but only the argument similarity within a stance needs to be assessed. The results of our contribution to the task are build on a setup based on the BERT architecture. We fine-tuned a pre-trained BERT model for three epochs and used the first 512 tokens of each argument to predict if two arguments share the same stance.
The tagging of on-line content with informative keywords is a widespread phenomenon from scientific article repositories through blogs to on-line news portals. In most of the cases, the tags on a given item are free words chosen by the authors independently. Therefore, relations among keywords in a collection of news items is unknown. However, in most cases the topics and concepts described by these keywords are forming a latent hierarchy, with the more general topics and categories at the top, and more specialised ones at the bottom. Here we apply a recent, cooccurrence-based tag hierarchy extraction method to sets of keywords obtained from four different on-line news portals. The resulting hierarchies show substantial differences not just in the topics rendered as important (being at the top of the hierarchy) or of less interest (categorised low in the hierarchy), but also in the underlying network structure. This reveals discrepancies between the plausible keyword association frameworks in the studied news portals.