Since the start of COVID-19, several relevant corpora from various sources are presented in the literature that contain millions of data points. While these corpora are valuable in supporting many analyses on this specific pandemic, researchers require additional benchmark corpora that contain other epidemics to facilitate cross-epidemic pattern recognition and trend analysis tasks. During our other efforts on COVID-19 related work, we discover very little disease related corpora in the literature that are sizable and rich enough to support such cross-epidemic analysis tasks. In this paper, we present EPIC, a large-scale epidemic corpus that contains 20 millions micro-blog posts, i.e., tweets crawled from Twitter, from year 2006 to 2020. EPIC contains a subset of 17.8 millions tweets related to three general diseases, namely Ebola, Cholera and Swine Flu, and another subset of 3.5 millions tweets of six global epidemic outbreaks, including 2009 H1N1 Swine Flu, 2010 Haiti Cholera, 2012 Middle-East Respiratory Syndrome (MERS), 2013 West African Ebola, 2016 Yemen Cholera and 2018 Kivu Ebola. Furthermore, we explore and discuss the properties of the corpus with statistics of key terms and hashtags and trends analysis for each subset. Finally, we demonstrate the value and impact that EPIC could create through a discussion of multiple use cases of cross-epidemic research topics that attract growing interest in recent years. These use cases span multiple research areas, such as epidemiological modeling, pattern recognition, natural language understanding and economical modeling.
Classification of crisis events, such as natural disasters, terrorist attacks and pandemics, is a crucial task to create early signals and inform relevant parties for spontaneous actions to reduce overall damage. Despite crisis such as natural disasters can be predicted by professional institutions, certain events are first signaled by civilians, such as the recent COVID-19 pandemics. Social media platforms such as Twitter often exposes firsthand signals on such crises through high volume information exchange over half a billion tweets posted daily. Prior works proposed various crisis embeddings and classification using conventional Machine Learning and Neural Network models. However, none of the works perform crisis embedding and classification using state of the art attention-based deep neural networks models, such as Transformers and document-level contextual embeddings. This work proposes CrisisBERT, an end-to-end transformer-based model for two crisis classification tasks, namely crisis detection and crisis recognition, which shows promising results across accuracy and f1 scores. The proposed model also demonstrates superior robustness over benchmark, as it shows marginal performance compromise while extending from 6 to 36 events with only 51.4% additional data points. We also proposed Crisis2Vec, an attention-based, document-level contextual embedding architecture for crisis embedding, which achieve better performance than conventional crisis embedding methods such as Word2Vec and GloVe. To the best of our knowledge, our works are first to propose using transformer-based crisis classification and document-level contextual crisis embedding in the literature.
Classification of crisis events, such as natural disasters, terrorist attacks and pandemics, is a crucial task to create early signals and inform relevant parties for spontaneous actions to reduce overall damage. Despite crisis such as natural disasters can be predicted by professional institutions, certain events are first signaled by civilians, such as the recent COVID-19 pandemics. Social media platforms such as Twitter often exposes firsthand signals on such crises through high volume information exchange over half a billion tweets posted daily. Prior works proposed various crisis embeddings and classification using conventional Machine Learning and Neural Network models. However, none of the works perform crisis embedding and classification using state of the art attention-based deep neural networks models, such as Transformers and document-level contextual embeddings. This work proposes CrisisBERT, an end-to-end transformer-based model for two crisis classification tasks, namely crisis detection and crisis recognition, which shows promising results across accuracy and f1 scores. The proposed model also demonstrates superior robustness over benchmark, as it shows marginal performance compromise while extending from 6 to 36 events with only 51.4% additional data points. We also proposed Crisis2Vec, an attention-based, document-level contextual embedding architecture for crisis embedding, which achieve better performance than conventional crisis embedding methods such as Word2Vec and GloVe. To the best of our knowledge, our works are first to propose using transformer-based crisis classification and document-level contextual crisis embedding in the literature.
There has been growing interest in utilizing occupational data mining and analysis. In today's job market, occupational data mining and analysis is growing in importance as it enables companies to predict employee turnover, model career trajectories, screen through resumes and perform other human resource tasks. A key requirement to facilitate these tasks is the need for an occupation-related dataset. However, most research use proprietary datasets or do not make their dataset publicly available, thus impeding development in this area. To solve this issue, we present the Industrial and Professional Occupation Dataset (IPOD), which comprises 192k job titles belonging to 56k LinkedIn users. In addition to making IPOD publicly available, we also: (i) manually annotate each job title with its associated level of seniority, domain of work and location; and (ii) provide embedding for job titles and discuss various use cases. This dataset is publicly available at https://github.com/junhua/ipod.
Job titles are the most fundamental building blocks for occupational data mining tasks, such as Career Modelling and Job Recommendation. However, there are no publicly available dataset to support such efforts. In this work, we present the Industrial and Professional Occupations Dataset (IPOD), which is a comprehensive corpus that consists of over 190,000 job titles crawled from over 56,000 profiles from Linkedin. To the best of our knowledge, IPOD is the first dataset released for industrial occupations mining. We use a knowledge-based approach for sequence tagging, creating a gazzetteer with domain-specific named entities tagged by 3 experts. All title NE tags are populated by the gazetteer using BIOES scheme. Finally, We develop 4 baseline models for the dataset on NER task with several models, including Linear Regression, CRF, LSTM and the state-of-the-art bi-directional LSTM-CRF. Both CRF and LSTM-CRF outperform human in both exact-match accuracy and f1 scores.
The demand for Itinerary Planning grows rapidly in recent years as the economy and standard of living are improving globally. Nonetheless, itinerary recommendation remains a complex and difficult task, especially for one that is queuing time- and crowd-aware. This difficulty is due to the large amount of parameters involved, i.e., attraction popularity, queuing time, walking time, operating hours, etc. Many recent or existing works adopt a data-driven approach and propose solutions with single-person perspectives, but do not address real-world problems as a result of natural crowd behavior, such as the Selfish Routing problem, which describes the consequence of ineffective network and sub-optimal social outcome by leaving agents to decide freely. In this work, we propose the Strategic and Crowd-Aware Itinerary Recommendation (SCAIR) algorithm which takes a game-theoretic approach to address the Selfish Routing problem and optimize social welfare in real-world situations. To address the NP-hardness of the social welfare optimization problem, we further propose a Markov Decision Process (MDP) approach which enables our simulations to be carried out in poly-time. We then use real-world data to evaluate the proposed algorithm, with benchmarks of two intuitive strategies commonly adopted in real life, and a recent algorithm published in the literature. Our simulation results highlight the existence of the Selfish Routing problem and show that SCAIR outperforms the benchmarks in handling this issue with real-world data.
Recent years has witnessed dramatic progress of neural machine translation (NMT), however, the method of manually guiding the translation procedure remains to be better explored. Previous works proposed to handle such problem through lexcially-constrained beam search in the decoding phase. Unfortunately, these lexically-constrained beam search methods suffer two fatal disadvantages: high computational complexity and hard beam search which generates unexpected translations. In this paper, we propose to learn the ability of lexically-constrained translation with external memory, which can overcome the above mentioned disadvantages. For the training process, automatically extracted phrase pairs are extracted from alignment and sentence parsing, then further be encoded into an external memory. This memory is then used to provide lexically-constrained information for training through a memory-attention machanism. Various experiments are conducted on WMT Chinese to English and English to German tasks. All the results can demonstrate the effectiveness of our method.
This paper describes the USTC-NEL system to the speech translation task of the IWSLT Evaluation 2018. The system is a conventional pipeline system which contains 3 modules: speech recognition, post-processing and machine translation. We train a group of hybrid-HMM models for our speech recognition, and for machine translation we train transformer based neural machine translation models with speech recognition output style text as input. Experiments conducted on the IWSLT 2018 task indicate that, compared to baseline system from KIT, our system achieved 14.9 BLEU improvement.