Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Niche as a determinant of word fate in online groups

Jun 02, 2011
Eduardo G. Altmann, Janet B. Pierrehumbert, Adilson E. Motter

Patterns of word use both reflect and influence a myriad of human activities and interactions. Like other entities that are reproduced and evolve, words rise or decline depending upon a complex interplay between {their intrinsic properties and the environments in which they function}. Using Internet discussion communities as model systems, we define the concept of a word niche as the relationship between the word and the characteristic features of the environments in which it is used. We develop a method to quantify two important aspects of the size of the word niche: the range of individuals using the word and the range of topics it is used to discuss. Controlling for word frequency, we show that these aspects of the word niche are strong determinants of changes in word frequency. Previous studies have already indicated that word frequency itself is a correlate of word success at historical time scales. Our analysis of changes in word frequencies over time reveals that the relative sizes of word niches are far more important than word frequencies in the dynamics of the entire vocabulary at shorter time scales, as the language adapts to new concepts and social groupings. We also distinguish endogenous versus exogenous factors as additional contributors to the fates of words, and demonstrate the force of this distinction in the rise of novel words. Our results indicate that short-term nonstationarity in word statistics is strongly driven by individual proclivities, including inclinations to provide novel information and to project a distinctive social identity.

* PLoS ONE 6(5), e19009 (2011) 
* Supporting Information is available here: 

  Access Paper or Ask Questions

MuMiN: A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset

Mar 08, 2022
Dan Saattrup Nielsen, Ryan McConville

Misinformation is becoming increasingly prevalent on social media and in news articles. It has become so widespread that we require algorithmic assistance utilising machine learning to detect such content. Training these machine learning models require datasets of sufficient scale, diversity and quality. However, datasets in the field of automatic misinformation detection are predominantly monolingual, include a limited amount of modalities and are not of sufficient scale and quality. Addressing this, we develop a data collection and linking system (MuMiN-trawl), to build a public misinformation graph dataset (MuMiN), containing rich social media data (tweets, replies, users, images, articles, hashtags) spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade. The dataset is made available as a heterogeneous graph via a Python package (mumin). We provide baseline results for two node classification tasks related to the veracity of a claim involving social media, and demonstrate that these are challenging tasks, with the highest macro-average F1-score being 62.55% and 61.45% for the two tasks, respectively. The MuMiN ecosystem is available at, including the data, documentation, tutorials and leaderboards.

* 9+3 pages 

  Access Paper or Ask Questions

Building Autocorrelation-Aware Representations for Fine-Scale Spatiotemporal Prediction

Dec 10, 2021
Yijun Lin, Yao-Yi Chiang, Meredith Franklin, Sandrah P. Eckel, José Luis Ambite

Many scientific prediction problems have spatiotemporal data- and modeling-related challenges in handling complex variations in space and time using only sparse and unevenly distributed observations. This paper presents a novel deep learning architecture, Deep learning predictions for LocATion-dependent Time-sEries data (DeepLATTE), that explicitly incorporates theories of spatial statistics into neural networks to address these challenges. In addition to a feature selection module and a spatiotemporal learning module, DeepLATTE contains an autocorrelation-guided semi-supervised learning strategy to enforce both local autocorrelation patterns and global autocorrelation trends of the predictions in the learned spatiotemporal embedding space to be consistent with the observed data, overcoming the limitation of sparse and unevenly distributed observations. During the training process, both supervised and semi-supervised losses guide the updates of the entire network to: 1) prevent overfitting, 2) refine feature selection, 3) learn useful spatiotemporal representations, and 4) improve overall prediction. We conduct a demonstration of DeepLATTE using publicly available data for an important public health topic, air quality prediction, in a well-studied, complex physical environment - Los Angeles. The experiment demonstrates that the proposed approach provides accurate fine-spatial-scale air quality predictions and reveals the critical environmental factors affecting the results.

* Published in ICDM2020 

  Access Paper or Ask Questions

Robust Multi-view Registration of Point Sets with Laplacian Mixture Model

Oct 26, 2021
Jin Zhang, Mingyang Zhao, Xin Jiang, Dong-Ming Yan

Point set registration is an essential step in many computer vision applications, such as 3D reconstruction and SLAM. Although there exist many registration algorithms for different purposes, however, this topic is still challenging due to the increasing complexity of various real-world scenarios, such as heavy noise and outlier contamination. In this paper, we propose a novel probabilistic generative method to simultaneously align multiple point sets based on the heavy-tailed Laplacian distribution. The proposed method assumes each data point is generated by a Laplacian Mixture Model (LMM), where its centers are determined by the corresponding points in other point sets. Different from the previous Gaussian Mixture Model (GMM) based method, which minimizes the quadratic distance between points and centers of Gaussian probability density, LMM minimizes the sparsity-induced L1 distance, thereby it is more robust against noise and outliers. We adopt Expectation-Maximization (EM) framework to solve LMM parameters and rigid transformations. We approximate the L1 optimization as a linear programming problem by exponential mapping in Lie algebra, which can be effectively solved through the interior point method. To improve efficiency, we also solve the L1 optimization by Alternating Direction Multiplier Method (ADMM). We demonstrate the advantages of our method by comparing it with representative state-of-the-art approaches on benchmark challenging data sets, in terms of robustness and accuracy.

  Access Paper or Ask Questions

Towards Multi-Functional 6G Wireless Networks: Integrating Sensing, Communication and Security

Jul 16, 2021
Zhongxiang Wei, Fan Liu, Christos Masouros, Nanchi Su, Athina P. Petropulu

Integrated sensing and communication (ISAC) has recently emerged as a candidate 6G technology, aiming to unify the two key operations of the future network in spectrum/energy/cost efficient way. ISAC involves communicating information to receivers and simultaneously sensing targets, while both operations use the same waveforms, the same transmitter and ultimately the same network infrastructure. Nevertheless, the inclusion of information signalling into the probing waveform for target sensing raises unique and difficult challenges from the perspective of information security. At the same time, the sensing capability incorporated in the ISAC transmission offers unique opportunities to design secure ISAC techniques. This overview paper discusses these unique challenges and opportunities for the next generation of ISAC networks. We first briefly discuss the fundamentals of waveform design for sensing and communication. Then, we detail the challenges and contradictory objectives involved in securing ISAC transmission, along with state-of-the-art approaches to address them. We then identify the new opportunity of using the sensing capability to obtain knowledge of the targets, as an enabling approach against known weaknesses of PHY security. Finally, we illustrate a low-cost secure ISAC architecture, followed by a series of open research topics. This family of sensing-aided secure ISAC techniques brings a new insight on providing information security, with an eye on robust and hardware-constrained designs tailored for low-cost ISAC devices.

  Access Paper or Ask Questions

Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

Apr 19, 2021
Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux

This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. In our prior work, we proposed a context-expanded Transformer that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance, achieving 5-15% relative error reduction from utterance-based baselines in lecture and conversational ASR benchmarks. Although the results have shown remarkable performance gain, there is still potential to further improve the model architecture and the decoding process. In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention. We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance, obtaining a 17.3% character error rate for the HKUST dataset and 12.0%/6.3% word error rates for the Switchboard-300 Eval2000 CallHome/Switchboard test sets. The new decoding method reduces decoding time by more than 50% and further enables streaming ASR with limited accuracy degradation.

* Submitted to INTERSPEECH 2021 

  Access Paper or Ask Questions

SIGAN: A Novel Image Generation Method for Solar Cell Defect Segmentation and Augmentation

Apr 11, 2021
Binyi Su, Zhong Zhou, Haiyong Chen, Xiaochun Cao

Solar cell electroluminescence (EL) defect segmentation is an interesting and challenging topic. Many methods have been proposed for EL defect detection, but these methods are still unsatisfactory due to the diversity of the defect and background. In this paper, we provide a new idea of using generative adversarial network (GAN) for defect segmentation. Firstly, the GAN-based method removes the defect region in the input defective image to get a defect-free image, while keeping the background almost unchanged. Then, the subtracted image is obtained by making difference between the defective input image with the generated defect-free image. Finally, the defect region can be segmented through thresholding the subtracted image. To keep the background unchanged before and after image generation, we propose a novel strong identity GAN (SIGAN), which adopts a novel strong identity loss to constraint the background consistency. The SIGAN can be used not only for defect segmentation, but also small-samples defective dataset augmentation. Moreover, we release a new solar cell EL image dataset named as EL-2019, which includes three types of images: crack, finger interruption and defect-free. Experiments on EL-2019 dataset show that the proposed method achieves 90.34% F-score, which outperforms many state-of-the-art methods in terms of solar cell defects segmentation results.

* 11 pages, 11 figures 

  Access Paper or Ask Questions

Using Transformer based Ensemble Learning to classify Scientific Articles

Feb 19, 2021
Sohom Ghosh, Ankush Chopra

Many time reviewers fail to appreciate novel ideas of a researcher and provide generic feedback. Thus, proper assignment of reviewers based on their area of expertise is necessary. Moreover, reading each and every paper from end-to-end for assigning it to a reviewer is a tedious task. In this paper, we describe a system which our team FideLIPI submitted in the shared task of SDPRA-2021 [14]. It comprises four independent sub-systems capable of classifying abstracts of scientific literature to one of the given seven classes. The first one is a RoBERTa [10] based model built over these abstracts. Adding topic models / Latent dirichlet allocation (LDA) [2] based features to the first model results in the second sub-system. The third one is a sentence level RoBERTa [10] model. The fourth one is a Logistic Regression model built using Term Frequency Inverse Document Frequency (TF-IDF) features. We ensemble predictions of these four sub-systems using majority voting to develop the final system which gives a F1 score of 0.93 on the test and validation set. This outperforms the existing State Of The Art (SOTA) model SciBERT's [1] in terms of F1 score on the validation set.Our codebase is available at

* 8 pages, 3 tables, 1 figure, Accepted at SDPRA-2021 (Collocated with PAKDD 2021) 

  Access Paper or Ask Questions

Learning Student Interest Trajectory for MOOCThread Recommendation

Jan 16, 2021
Shalini Pandey, Andrew Lan, George Karypis, Jaideep Srivastava

In recent years, Massive Open Online Courses (MOOCs) have witnessed immense growth in popularity. Now, due to the recent Covid19 pandemic situation, it is important to push the limits of online education. Discussion forums are primary means of interaction among learners and instructors. However, with growing class size, students face the challenge of finding useful and informative discussion forums. This problem can be solved by matching the interest of students with thread contents. The fundamental challenge is that the student interests drift as they progress through the course, and forum contents evolve as students or instructors update them. In our paper, we propose to predict future interest trajectories of students. Our model consists of two key operations: 1) Update operation and 2) Projection operation. Update operation models the inter-dependency between the evolution of student and thread using coupled Recurrent Neural Networks when the student posts on the thread. The projection operation learns to estimate future embedding of students and threads. For students, the projection operation learns the drift in their interests caused by the change in the course topic they study. The projection operation for threads exploits how different posts induce varying interest levels in a student according to the thread structure. Extensive experimentation on three real-world MOOC datasets shows that our model significantly outperforms other baselines for thread recommendation.

* Accepted at IEEE ICDM Workshop on Continual Learning and Adaptation for Time Evolving Data, 2020 

  Access Paper or Ask Questions