Climate change is a burning issue of our time, with the Sustainable Development Goal (SDG) 13 of the United Nations demanding global climate action. Realizing the urgency, in 2015 in Paris, world leaders signed an agreement committing to taking voluntary action to reduce carbon emissions. However, the scale, magnitude, and climate action processes vary globally, especially between developed and developing countries. Therefore, from parliament to social media, the debates and discussions on climate change gather data from wide-ranging sources essential to the policy design and implementation. The downside is that we do not currently have the mechanisms to pool the worldwide dispersed knowledge emerging from the structured and unstructured data sources. The paper thematically discusses how NLP techniques could be employed in climate policy research and contribute to society's good at large. In particular, we exemplify symbiosis of NLP and Climate Policy Research via four methodologies. The first one deals with the major topics related to climate policy using automated content analysis. We investigate the opinions (sentiments) of major actors' narratives towards climate policy in the second methodology. The third technique explores the climate actors' beliefs towards pro or anti-climate orientation. Finally, we discuss developing a Climate Knowledge Graph. The present theme paper further argues that creating a knowledge platform would help in the formulation of a holistic climate policy and effective climate action. Such a knowledge platform would integrate the policy actors' varied opinions from different social sectors like government, business, civil society, and the scientific community. The research outcome will add value to effective climate action because policymakers can make informed decisions by looking at the diverse public opinion on a comprehensive platform.
Information extracted from social media streams has been leveraged to forecast the outcome of a large number of real-world events, from political elections to stock market fluctuations. An increasing amount of studies demonstrates how the analysis of social media conversations provides cheap access to the wisdom of the crowd. However, extents and contexts in which such forecasting power can be effectively leveraged are still unverified at least in a systematic way. It is also unclear how social-media-based predictions compare to those based on alternative information sources. To address these issues, here we develop a machine learning framework that leverages social media streams to automatically identify and predict the outcomes of soccer matches. We focus in particular on matches in which at least one of the possible outcomes is deemed as highly unlikely by professional bookmakers. We argue that sport events offer a systematic approach for testing the predictive power of social media, and allow to compare such power against the rigorous baselines set by external sources. Despite such strict baselines, our framework yields above 8% marginal profit when used to inform simple betting strategies. The system is based on real-time sentiment analysis and exploits data collected immediately before the games, allowing for informed bets. We discuss the rationale behind our approach, describe the learning framework, its prediction performance and the return it provides as compared to a set of betting strategies. To test our framework we use both historical Twitter data from the 2014 FIFA World Cup games, and real-time Twitter data collected by monitoring the conversations about all soccer matches of four major European tournaments (FA Premier League, Serie A, La Liga, and Bundesliga), and the 2014 UEFA Champions League, during the period between Oct. 25th 2014 and Nov. 26th 2014.
Data mining, machine learning, and natural language processing are powerful techniques that can be used together to extract information from large texts. Depending on the task or problem at hand, there are many different approaches that can be used. The methods available are continuously being optimized, but not all these methods have been tested and compared in a set of problems that can be solved using supervised machine learning algorithms. The question is what happens to the quality of the methods if we increase the training data size from, say, 100 MB to over 1 GB? Moreover, are quality gains worth it when the rate of data processing diminishes? Can we trade quality for time efficiency and recover the quality loss by just being able to process more data? We attempt to answer these questions in a general way for text processing tasks, considering the trade-offs involving training data size, learning time, and quality obtained. We propose a performance trade-off framework and apply it to three important text processing problems: Named Entity Recognition, Sentiment Analysis and Document Classification. These problems were also chosen because they have different levels of object granularity: words, paragraphs, and documents. For each problem, we selected several supervised machine learning algorithms and we evaluated the trade-offs of them on large publicly available data sets (news, reviews, patents). To explore these trade-offs, we use different data subsets of increasing size ranging from 50 MB to several GB. We also consider the impact of the data set and the evaluation technique. We find that the results do not change significantly and that most of the time the best algorithms is the fastest. However, we also show that the results for small data (say less than 100 MB) are different from the results for big data and in those cases the best algorithm is much harder to determine.
Extreme pricing anomalies may occur unexpectedly without a trivial cause, and equity traders typically experience a meticulous process to source disparate information and analyze its reliability before integrating it into the trusted knowledge base. We introduce DeepTrust, a reliable financial knowledge retrieval framework on Twitter to explain extreme price moves at speed, while ensuring data veracity using state-of-the-art NLP techniques. Our proposed framework consists of three modules, specialized for anomaly detection, information retrieval and reliability assessment. The workflow starts with identifying anomalous asset price changes using machine learning models trained with historical pricing data, and retrieving correlated unstructured data from Twitter using enhanced queries with dynamic search conditions. DeepTrust extrapolates information reliability from tweet features, traces of generative language model, argumentation structure, subjectivity and sentiment signals, and refine a concise collection of credible tweets for market insights. The framework is evaluated on two self-annotated financial anomalies, i.e., Twitter and Facebook stock price on 29 and 30 April 2021. The optimal setup outperforms the baseline classifier by 7.75% and 15.77% on F0.5-scores, and 10.55% and 18.88% on precision, respectively, proving its capability in screening unreliable information precisely. At the same time, information retrieval and reliability assessment modules are analyzed individually on their effectiveness and causes of limitations, with identified subjective and objective factors that influence the performance. As a collaborative project with Refinitiv, this framework paves a promising path towards building a scalable commercial solution that assists traders to reach investment decisions on pricing anomalies with authenticated knowledge from social media platforms in real-time.
Feature extraction is a critical component of many applied data science workflows. In recent years, rapid advances in artificial intelligence and machine learning have led to an explosion of feature extraction tools and services that allow data scientists to cheaply and effectively annotate their data along a vast array of dimensions---ranging from detecting faces in images to analyzing the sentiment expressed in coherent text. Unfortunately, the proliferation of powerful feature extraction services has been mirrored by a corresponding expansion in the number of distinct interfaces to feature extraction services. In a world where nearly every new service has its own API, documentation, and/or client library, data scientists who need to combine diverse features obtained from multiple sources are often forced to write and maintain ever more elaborate feature extraction pipelines. To address this challenge, we introduce a new open-source framework for comprehensive multimodal feature extraction. Pliers is an open-source Python package that supports standardized annotation of diverse data types (video, images, audio, and text), and is expressly with both ease-of-use and extensibility in mind. Users can apply a wide range of pre-existing feature extraction tools to their data in just a few lines of Python code, and can also easily add their own custom extractors by writing modular classes. A graph-based API enables rapid development of complex feature extraction pipelines that output results in a single, standardized format. We describe the package's architecture, detail its major advantages over previous feature extraction toolboxes, and use a sample application to a large functional MRI dataset to illustrate how pliers can significantly reduce the time and effort required to construct sophisticated feature extraction workflows while increasing code clarity and maintainability.
Monitoring social discourse about COVID-19 vaccines is key to understanding how large populations perceive vaccination campaigns. We focus on 4765 unique popular tweets in English or Italian about COVID-19 vaccines between 12/2020 and 03/2021. One popular English tweet was liked up to 495,000 times, stressing how popular tweets affected cognitively massive populations. We investigate both text and multimedia in tweets, building a knowledge graph of syntactic/semantic associations in messages including visual features and indicating how online users framed social discourse mostly around the logistics of vaccine distribution. The English semantic frame of "vaccine" was highly polarised between trust/anticipation (towards the vaccine as a scientific asset saving lives) and anger/sadness (mentioning critical issues with dose administering). Semantic associations with "vaccine," "hoax" and conspiratorial jargon indicated the persistence of conspiracy theories and vaccines in massively read English posts (absent in Italian messages). The image analysis found that popular tweets with images of people wearing face masks used language lacking the trust and joy found in tweets showing people with no masks, indicating a negative affect attributed to face covering in social discourse. A behavioural analysis revealed a tendency for users to share content eliciting joy, sadness and disgust and to like less sad messages, highlighting an interplay between emotions and content diffusion beyond sentiment. With the AstraZeneca vaccine being suspended in mid March 2021, "Astrazeneca" was associated with trustful language driven by experts, but popular Italian tweets framed "vaccine" by crucially replacing earlier levels of trust with deep sadness. Our results stress how cognitive networks and innovative multimedia processing open new ways for reconstructing online perceptions about vaccines and trust.
In this paper, we extensively present our solutions for the MuSe-Stress sub-challenge and the MuSe-Physio sub-challenge of Multimodal Sentiment Challenge (MuSe) 2021. The goal of MuSe-Stress sub-challenge is to predict the level of emotional arousal and valence in a time-continuous manner from audio-visual recordings and the goal of MuSe-Physio sub-challenge is to predict the level of psycho-physiological arousal from a) human annotations fused with b) galvanic skin response (also known as Electrodermal Activity (EDA)) signals from the stressed people. The Ulm-TSST dataset which is a novel subset of the audio-visual textual Ulm-Trier Social Stress dataset that features German speakers in a Trier Social Stress Test (TSST) induced stress situation is used in both sub-challenges. For the MuSe-Stress sub-challenge, we highlight our solutions in three aspects: 1) the audio-visual features and the bio-signal features are used for emotional state recognition. 2) the Long Short-Term Memory (LSTM) with the self-attention mechanism is utilized to capture complex temporal dependencies within the feature sequences. 3) the late fusion strategy is adopted to further boost the model's recognition performance by exploiting complementary information scattered across multimodal sequences. Our proposed model achieves CCC of 0.6159 and 0.4609 for valence and arousal respectively on the test set, which both rank in the top 3. For the MuSe-Physio sub-challenge, we first extract the audio-visual features and the bio-signal features from multiple modalities. Then, the LSTM module with the self-attention mechanism, and the Gated Convolutional Neural Networks (GCNN) as well as the LSTM network are utilized for modeling the complex temporal dependencies in the sequence. Finally, the late fusion strategy is used. Our proposed method also achieves CCC of 0.5412 on the test set, which ranks in the top 3.
In this paper, we use several techniques with conventional vocal feature extraction (MFCC, STFT), along with deep-learning approaches such as CNN, and also context-level analysis, by providing the textual data, and combining different approaches for improved emotion-level classification. We explore models that have not been tested to gauge the difference in performance and accuracy. We apply hyperparameter sweeps and data augmentation to improve performance. Finally, we see if a real-time approach is feasible, and can be readily integrated into existing systems.
How do news sources tackle controversial issues? In this work, we take a data-driven approach to understand how controversy interplays with emotional expression and biased language in the news. We begin by introducing a new dataset of controversial and non-controversial terms collected using crowdsourcing. Then, focusing on 15 major U.S. news outlets, we compare millions of articles discussing controversial and non-controversial issues over a span of 7 months. We find that in general, when it comes to controversial issues, the use of negative affect and biased language is prevalent, while the use of strong emotion is tempered. We also observe many differences across news sources. Using these findings, we show that we can indicate to what extent an issue is controversial, by comparing it with other issues in terms of how they are portrayed across different media.
Given a patent document, identifying distinct semantic annotations is an interesting research aspect. Text annotation helps the patent practitioners such as examiners and patent attorneys to quickly identify the key arguments of any invention, successively providing a timely marking of a patent text. In the process of manual patent analysis, to attain better readability, recognising the semantic information by marking paragraphs is in practice. This semantic annotation process is laborious and time-consuming. To alleviate such a problem, we proposed a novel dataset to train Machine Learning algorithms to automate the highlighting process. The contributions of this work are: i) we developed a multi-class, novel dataset of size 150k samples by traversing USPTO patents over a decade, ii) articulated statistics and distributions of data using imperative exploratory data analysis, iii) baseline Machine Learning models are developed to utilize the dataset to address patent paragraph highlighting task, iv) dataset and codes relating to this task are open-sourced through a dedicated GIT web page: https://github.com/Renuk9390/Patent_Sentiment_Analysis and v) future path to extend this work using Deep Learning and domain specific pre-trained language models to develop a tool to highlight is provided. This work assist patent practitioners in highlighting semantic information automatically and aid to create a sustainable and efficient patent analysis using the aptitude of Machine Learning.