Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Smart Crawling: A New Approach toward Focus Crawling from Twitter

Oct 08, 2021
Ahmad Khazaie, Nacéra Bennacer Seghouani, Francesca Bugiotti

Twitter is a social network that offers a rich and interesting source of information challenging to retrieve and analyze. Twitter data can be accessed using a REST API. The available operations allow retrieving tweets on the basis of a set of keywords but with limitations such as the number of calls per minute and the size of results. Besides, there is no control on retrieved results and finding tweets which are relevant to a specific topic is a big issue. Given these limitations, it is important that the query keywords cover unambiguously the topic of interest in order to both reach the relevant answers and decrease the number of API calls. In this paper, we introduce a new crawling algorithm called "SmartTwitter Crawling" (STiC) that retrieves a set of tweets related to a target topic. In this algorithm, we take an initial keyword query and enrich it using a set of additional keywords that come from different data sources. STiC algorithm relies on a DFS search in Twittergraph where each reached tweet is considered if it is relevant with the query keywords using a scoring, updated throughout the whole crawling process. This scoring takes into account the tweet text, hashtags and the users who have posted the tweet, replied to the tweet, been mentioned in the tweet or retweeted the tweet. Given this score, STiC is able to select relevant tweets in each iteration and continue by adding the related valuable tweets. Several experiments have been achieved for different kinds of queries, the results showedthat the precision increases compared to a simple BFS search.

  Access Paper or Ask Questions

Annotation as a New Paradigm in Research Archiving

Oct 07, 2014
Dirk Roorda, Charles van den Heuvel

We outline a paradigm to preserve results of digital scholarship, whether they are query results, feature values, or topic assignments. This paradigm is characterized by using annotations as multifunctional carriers and making them portable. The testing grounds we have chosen are two significant enterprises, one in the history of science, and one in Hebrew scholarship. The first one (CKCC) focuses on the results of a project where a Dutch consortium of universities, research institutes, and cultural heritage institutions experimented for 4 years with language techniques and topic modeling methods with the aim to analyze the emergence of scholarly debates. The data: a complex set of about 20.000 letters. The second one (DTHB) is a multi-year effort to express the linguistic features of the Hebrew bible in a text database, which is still growing in detail and sophistication. Versions of this database are packaged in commercial bible study software. We state that the results of these forms of scholarship require new knowledge management and archive practices. Only when researchers can build efficiently on each other's (intermediate) results, they can achieve the aggregations of quality data by which new questions can be answered, and hidden patterns visualized. Archives are required to find a balance between preserving authoritative versions of sources and supporting collaborative efforts in digital scholarship. Annotations are promising vehicles for preserving and reusing research results. Keywords annotation, portability, archiving, queries, features, topics, keywords, Republic of Letters, Hebrew text databases.

* Proceedings of the American Society for Information Science and Technology; Volume 49, Issue 1, pages 1-10, 2012 

  Access Paper or Ask Questions

Where Are We? Using Scopus to Map the Literature at the Intersection Between Artificial Intelligence and Crime

Dec 23, 2019
Gian Maria Campedelli

Research on Artificial Intelligence (AI) applications has spread over many scientific disciplines. Scientists have tested the power of intelligent algorithms developed to predict (or learn from) natural, physical and social phenomena. This also applies to crime-related research problems. Nonetheless, studies that map the current state of the art at the intersection between AI and crime are lacking. What are the current research trends in terms of topics in this area? What is the structure of scientific collaboration when considering works investigating criminal issues using machine learning, deep learning and AI in general? What are the most active countries in this specific scientific sphere? Using data retrieved from Scopus database, this work quantitatively analyzes published works at the intersection between AI and crime employing network science to respond to these questions. Results show that researchers are mainly focusing on cyber-related criminal topics and that relevant themes such as algorithmic discrimination, fairness, and ethics are considerably overlooked. Furthermore, data highlight the extremely disconnected structure of co-authorship networks. Such disconnectedness may represent a substantial obstacle to a more solid community of scientists interested in these topics. Additionally, the graph of scientific collaboration indicates that countries that are more prone to engage in international partnerships are generally less central in the network. This means that scholars working in highly productive countries (e.g. the United States, China) tend to collaborate with researchers based in their same countries. Finally, current issues and future developments within this scientific area are also discussed.

* 25 pages, 12 figures, pre-print 

  Access Paper or Ask Questions

Recruitment Market Trend Analysis with Sequential Latent Variable Models

Dec 08, 2017
Chen Zhu, Hengshu Zhu, Hui Xiong, Pengliang Ding, Fang Xie

Recruitment market analysis provides valuable understanding of industry-specific economic growth and plays an important role for both employers and job seekers. With the rapid development of online recruitment services, massive recruitment data have been accumulated and enable a new paradigm for recruitment market analysis. However, traditional methods for recruitment market analysis largely rely on the knowledge of domain experts and classic statistical models, which are usually too general to model large-scale dynamic recruitment data, and have difficulties to capture the fine-grained market trends. To this end, in this paper, we propose a new research paradigm for recruitment market analysis by leveraging unsupervised learning techniques for automatically discovering recruitment market trends based on large-scale recruitment data. Specifically, we develop a novel sequential latent variable model, named MTLVM, which is designed for capturing the sequential dependencies of corporate recruitment states and is able to automatically learn the latent recruitment topics within a Bayesian generative framework. In particular, to capture the variability of recruitment topics over time, we design hierarchical dirichlet processes for MTLVM. These processes allow to dynamically generate the evolving recruitment topics. Finally, we implement a prototype system to empirically evaluate our approach based on real-world recruitment data in China. Indeed, by visualizing the results from MTLVM, we can successfully reveal many interesting findings, such as the popularity of LBS related jobs reached the peak in the 2nd half of 2014, and decreased in 2015.

* 11 pages, 30 figure, SIGKDD 2016 

  Access Paper or Ask Questions

Quizz: Targeted crowdsourcing with a billion (potential) users

Jun 02, 2015
Panagiotis G. Ipeirotis, Evgeniy Gabrilovich

We describe Quizz, a gamified crowdsourcing system that simultaneously assesses the knowledge of users and acquires new knowledge from them. Quizz operates by asking users to complete short quizzes on specific topics; as a user answers the quiz questions, Quizz estimates the user's competence. To acquire new knowledge, Quizz also incorporates questions for which we do not have a known answer; the answers given by competent users provide useful signals for selecting the correct answers for these questions. Quizz actively tries to identify knowledgeable users on the Internet by running advertising campaigns, effectively leveraging the targeting capabilities of existing, publicly available, ad placement services. Quizz quantifies the contributions of the users using information theory and sends feedback to the advertisingsystem about each user. The feedback allows the ad targeting mechanism to further optimize ad placement. Our experiments, which involve over ten thousand users, confirm that we can crowdsource knowledge curation for niche and specialized topics, as the advertising network can automatically identify users with the desired expertise and interest in the given topic. We present controlled experiments that examine the effect of various incentive mechanisms, highlighting the need for having short-term rewards as goals, which incentivize the users to contribute. Finally, our cost-quality analysis indicates that the cost of our approach is below that of hiring workers through paid-crowdsourcing platforms, while offering the additional advantage of giving access to billions of potential users all over the planet, and being able to reach users with specialized expertise that is not typically available through existing labor marketplaces.

* WWW '14 Proceedings of the 23rd international conference on World Wide Web. 11 pages 

  Access Paper or Ask Questions

Concise comparative summaries (CCS) of large text corpora with a human experiment

Apr 29, 2014
Jinzhu Jia, Luke Miratrix, Bin Yu, Brian Gawalt, Laurent El Ghaoui, Luke Barnesmoore, Sophie Clavier

In this paper we propose a general framework for topic-specific summarization of large text corpora and illustrate how it can be used for the analysis of news databases. Our framework, concise comparative summarization (CCS), is built on sparse classification methods. CCS is a lightweight and flexible tool that offers a compromise between simple word frequency based methods currently in wide use and more heavyweight, model-intensive methods such as latent Dirichlet allocation (LDA). We argue that sparse methods have much to offer for text analysis and hope CCS opens the door for a new branch of research in this important field. For a particular topic of interest (e.g., China or energy), CSS automatically labels documents as being either on- or off-topic (usually via keyword search), and then uses sparse classification methods to predict these labels with the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. To validate our tool, we, using news articles from the New York Times international section, designed and conducted a human survey to compare the different summarizers with human understanding. We demonstrate our approach with two case studies, a media analysis of the framing of "Egypt" in the New York Times throughout the Arab Spring and an informal comparison of the New York Times' and Wall Street Journal's coverage of "energy." Overall, we find that the Lasso with $L^2$ normalization can be effectively and usefully used to summarize large corpora, regardless of document size.

* Annals of Applied Statistics 2014, Vol. 8, No. 1, 499-529 
* Published in at the Annals of Applied Statistics ( by the Institute of Mathematical Statistics (

  Access Paper or Ask Questions

Hidden in Plain Sight For Too Long: Using Text Mining Techniques to Shine a Light on Workplace Sexism and Sexual Harassment

Jul 01, 2019
Amir Karami, Suzanne C. Swan, Cynthia Nicole White, Kayla Ford

Objective: The goal of this study is to understand how people experience sexism and sexual harassment in the workplace by discovering themes in 2,362 experiences posted on the Everyday Sexism Project's website Method: This study used both quantitative and qualitative methods. The quantitative method was a computational framework to collect and analyze a large number of workplace sexual harassment experiences. The qualitative method was the analysis of the topics generated by a text mining method. Results: Twenty-three topics were coded and then grouped into three overarching themes from the sex discrimination and sexual harassment literature. The Sex Discrimination theme included experiences in which women were treated unfavorably due to their sex, such as being passed over for promotion, denied opportunities, paid less than men, and ignored or talked over in meetings. The Sex Discrimination and Gender harassment theme included stories about sex discrimination and gender harassment, such as sexist hostility behaviors ranging from insults and jokes invoking misogynistic stereotypes to bullying behavior. The last theme, Unwanted Sexual Attention, contained stories describing sexual comments and behaviors used to degrade women. Unwanted touching was the highest weighted topic, indicating how common it was for website users to endure being touched, hugged or kissed, groped, and grabbed. Conclusions: This study illustrates how researchers can use automatic processes to go beyond the limits of traditional research methods and investigate naturally occurring large scale datasets on the internet to achieve a better understanding of everyday workplace sexism experiences.

  Access Paper or Ask Questions

Deep Dive into Anonymity: A Large Scale Analysis of Quora Questions

Nov 17, 2018
Binny Mathew, Ritam Dutt, Suman Kalyan Maity, Pawan Goyal, Animesh Mukherjee

Anonymity forms an integral and important part of our digital life. It enables us to express our true selves without the fear of judgment. In this paper, we investigate the different aspects of anonymity in the social Q&A site Quora. The choice of Quora is motivated by the fact that this is one of the rare social Q&A sites that allow users to explicitly post anonymous questions and such activity in this forum has become normative rather than a taboo. Through an analysis of 5.1 million questions, we observe that at a global scale almost no difference manifests between the linguistic structure of the anonymous and the non-anonymous questions. We find that topical mixing at the global scale to be the primary reason for the absence. However, the differences start to feature once we "deep dive" and (topically) cluster the questions and compare the clusters that have high volumes of anonymous questions with those that have low volumes of anonymous questions. In particular, we observe that the choice to post the question as anonymous is dependent on the user's perception of anonymity and they often choose to speak about depression, anxiety, social ties and personal issues under the guise of anonymity. We further perform personality trait analysis and observe that the anonymous group of users has positive correlation with extraversion, agreeableness, and negative correlation with openness. Subsequently, to gain further insights, we build an anonymity grid to identify the differences in the perception on anonymity of the user posting the question and the community of users answering it. We also look into the first response time of the questions and observe that it is lowest for topics which talk about personal and sensitive issues, which hints toward a higher degree of community support and user engagement.

* 12 pages, 6 figures, and 12 tables 

  Access Paper or Ask Questions

Open vs Closed-ended questions in attitudinal surveys -- comparing, combining, and interpreting using natural language processing

May 03, 2022
Vishnu Baburajan, João de Abreu e Silva, Francisco Camara Pereira

To improve the traveling experience, researchers have been analyzing the role of attitudes in travel behavior modeling. Although most researchers use closed-ended surveys, the appropriate method to measure attitudes is debatable. Topic Modeling could significantly reduce the time to extract information from open-ended responses and eliminate subjective bias, thereby alleviating analyst concerns. Our research uses Topic Modeling to extract information from open-ended questions and compare its performance with closed-ended responses. Furthermore, some respondents might prefer answering questions using their preferred questionnaire type. So, we propose a modeling framework that allows respondents to use their preferred questionnaire type to answer the survey and enable analysts to use the modeling frameworks of their choice to predict behavior. We demonstrate this using a dataset collected from the USA that measures the intention to use Autonomous Vehicles for commute trips. Respondents were presented with alternative questionnaire versions (open- and closed- ended). Since our objective was also to compare the performance of alternative questionnaire versions, the survey was designed to eliminate influences resulting from statements, behavioral framework, and the choice experiment. Results indicate the suitability of using Topic Modeling to extract information from open-ended responses; however, the models estimated using the closed-ended questions perform better compared to them. Besides, the proposed model performs better compared to the models used currently. Furthermore, our proposed framework will allow respondents to choose the questionnaire type to answer, which could be particularly beneficial to them when using voice-based surveys.

  Access Paper or Ask Questions

CSRN: Collaborative Sequential Recommendation Networks for News Retrieval

Apr 07, 2020
Bing Bai, Guanhua Zhang, Ye Lin, Hao Li, Kun Bai, Bo Luo

Nowadays, news apps have taken over the popularity of paper-based media, providing a great opportunity for personalization. Recurrent Neural Network (RNN)-based sequential recommendation is a popular approach that utilizes users' recent browsing history to predict future items. This approach is limited that it does not consider the societal influences of news consumption, i.e., users may follow popular topics that are constantly changing, while certain hot topics might be spreading only among specific groups of people. Such societal impact is difficult to predict given only users' own reading histories. On the other hand, the traditional User-based Collaborative Filtering (UserCF) makes recommendations based on the interests of the "neighbors", which provides the possibility to supplement the weaknesses of RNN-based methods. However, conventional UserCF only uses a single similarity metric to model the relationships between users, which is too coarse-grained and thus limits the performance. In this paper, we propose a framework of deep neural networks to integrate the RNN-based sequential recommendations and the key ideas from UserCF, to develop Collaborative Sequential Recommendation Networks (CSRNs). Firstly, we build a directed co-reading network of users, to capture the fine-grained topic-specific similarities between users in a vector space. Then, the CSRN model encodes users with RNNs, and learns to attend to neighbors and summarize what news they are reading at the moment. Finally, news articles are recommended according to both the user's own state and the summarized state of the neighbors. Experiments on two public datasets show that the proposed model outperforms the state-of-the-art approaches significantly.

  Access Paper or Ask Questions