Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shanika Karunasekera

UrbanVerse: Learning Urban Region Representation Across Cities and Tasks

Feb 17, 2026

Fengze Sun, Egemen Tanin, Shanika Karunasekera, Zuqing Li, Flora D. Salim, Jianzhong Qi

Abstract:Recent advances in urban region representation learning have enabled a wide range of applications in urban analytics, yet existing methods remain limited in their capabilities to generalize across cities and analytic tasks. We aim to generalize urban representation learning beyond city- and task-specific settings, towards a foundation-style model for urban analytics. To this end, we propose UrbanVerse, a model for cross-city urban representation learning and cross-task urban analytics. For cross-city generalization, UrbanVerse focuses on features local to the target regions and structural features of the nearby regions rather than the entire city. We model regions as nodes on a graph, which enables a random walk-based procedure to form "sequences of regions" that reflect both local and neighborhood structural features for urban region representation learning. For cross-task generalization, we propose a cross-task learning module named HCondDiffCT. This module integrates region-conditioned prior knowledge and task-conditioned semantics into the diffusion process to jointly model multiple downstream urban prediction tasks. HCondDiffCT is generic. It can also be integrated with existing urban representation learning models to enhance their downstream task effectiveness. Experiments on real-world datasets show that UrbanVerse consistently outperforms state-of-the-art methods across six tasks under cross-city settings, achieving up to 35.89% improvements in prediction accuracy.

Via

Access Paper or Ask Questions

Langformers: Unified NLP Pipelines for Language Models

Apr 12, 2025

Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera

Figure 1 for Langformers: Unified NLP Pipelines for Language Models

Figure 2 for Langformers: Unified NLP Pipelines for Language Models

Abstract:Transformer-based language models have revolutionized the field of natural language processing (NLP). However, using these models often involves navigating multiple frameworks and tools, as well as writing repetitive boilerplate code. This complexity can discourage non-programmers and beginners, and even slow down prototyping for experienced developers. To address these challenges, we introduce Langformers, an open-source Python library designed to streamline NLP pipelines through a unified, factory-based interface for large language model (LLM) and masked language model (MLM) tasks. Langformers integrates conversational AI, MLM pretraining, text classification, sentence embedding/reranking, data labelling, semantic search, and knowledge distillation into a cohesive API, supporting popular platforms such as Hugging Face and Ollama. Key innovations include: (1) task-specific factories that abstract training, inference, and deployment complexities; (2) built-in memory and streaming for conversational agents; and (3) lightweight, modular design that prioritizes ease of use. Documentation: https://langformers.com

Via

Access Paper or Ask Questions

Urban Region Representation Learning: A Flexible Approach

Mar 12, 2025

Fengze Sun, Yanchuan Chang, Egemen Tanin, Shanika Karunasekera, Jianzhong Qi

Figure 1 for Urban Region Representation Learning: A Flexible Approach

Figure 2 for Urban Region Representation Learning: A Flexible Approach

Figure 3 for Urban Region Representation Learning: A Flexible Approach

Figure 4 for Urban Region Representation Learning: A Flexible Approach

Abstract:The increasing availability of urban data offers new opportunities for learning region representations, which can be used as input to machine learning models for downstream tasks such as check-in or crime prediction. While existing solutions have produced promising results, an issue is their fixed formation of regions and fixed input region features, which may not suit the needs of different downstream tasks. To address this limitation, we propose a model named FlexiReg for urban region representation learning that is flexible with both the formation of urban regions and the input region features. FlexiReg is based on a spatial grid partitioning over the spatial area of interest. It learns representations for the grid cells, leveraging publicly accessible data, including POI, land use, satellite imagery, and street view imagery. We propose adaptive aggregation to fuse the cell representations and prompt learning techniques to tailor the representations towards different tasks, addressing the needs of varying formations of urban regions and downstream tasks. Extensive experiments on five real-world datasets demonstrate that FlexiReg outperforms state-of-the-art models by up to 202% in term of the accuracy of four diverse downstream tasks using the produced urban region representations.

Via

Access Paper or Ask Questions

Large Language Model Driven Agents for Simulating Echo Chamber Formation

Feb 25, 2025

Chenhao Gu, Ling Luo, Zainab Razia Zaidi, Shanika Karunasekera

Figure 1 for Large Language Model Driven Agents for Simulating Echo Chamber Formation

Figure 2 for Large Language Model Driven Agents for Simulating Echo Chamber Formation

Figure 3 for Large Language Model Driven Agents for Simulating Echo Chamber Formation

Figure 4 for Large Language Model Driven Agents for Simulating Echo Chamber Formation

Abstract:The rise of echo chambers on social media platforms has heightened concerns about polarization and the reinforcement of existing beliefs. Traditional approaches for simulating echo chamber formation have often relied on predefined rules and numerical simulations, which, while insightful, may lack the nuance needed to capture complex, real-world interactions. In this paper, we present a novel framework that leverages large language models (LLMs) as generative agents to simulate echo chamber dynamics within social networks. The novelty of our approach is that it incorporates both opinion updates and network rewiring behaviors driven by LLMs, allowing for a context-aware and semantically rich simulation of social interactions. Additionally, we utilize real-world Twitter (now X) data to benchmark the LLM-based simulation against actual social media behaviors, providing insights into the accuracy and realism of the generated opinion trends. Our results demonstrate the efficacy of LLMs in modeling echo chamber formation, capturing both structural and semantic dimensions of opinion clustering. %This work contributes to a deeper understanding of social influence dynamics and offers a new tool for studying polarization in online communities.

Via

Access Paper or Ask Questions

"Actionable Help" in Crises: A Novel Dataset and Resource-Efficient Models for Identifying Request and Offer Social Media Posts

Feb 24, 2025

Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera, Muhammad Imran

Abstract:During crises, social media serves as a crucial coordination tool, but the vast influx of posts--from "actionable" requests and offers to generic content like emotional support, behavioural guidance, or outdated information--complicates effective classification. Although generative LLMs (Large Language Models) can address this issue with few-shot classification, their high computational demands limit real-time crisis response. While fine-tuning encoder-only models (e.g., BERT) is a popular choice, these models still exhibit higher inference times in resource-constrained environments. Moreover, although distilled variants (e.g., DistilBERT) exist, they are not tailored for the crisis domain. To address these challenges, we make two key contributions. First, we present CrisisHelpOffer, a novel dataset of 101k tweets collaboratively labelled by generative LLMs and validated by humans, specifically designed to distinguish actionable content from noise. Second, we introduce the first crisis-specific mini models optimized for deployment in resource-constrained settings. Across 13 crisis classification tasks, our mini models surpass BERT (also outperform or match the performance of RoBERTa, MPNet, and BERTweet), offering higher accuracy with significantly smaller sizes and faster speeds. The Medium model is 47% smaller with 3.8% higher accuracy at 3.5x speed, the Small model is 68% smaller with a 1.8% accuracy gain at 7.7x speed, and the Tiny model, 83% smaller, matches BERT's accuracy at 18.6x speed. All models outperform existing distilled variants, setting new benchmarks. Finally, as a case study, we analyze social media posts from a global crisis to explore help-seeking and assistance-offering behaviours in selected developing and developed countries.

Via

Access Paper or Ask Questions

CReMa: Crisis Response through Computational Identification and Matching of Cross-Lingual Requests and Offers Shared on Social Media

May 20, 2024

Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera, Muhammad Imran

Figure 1 for CReMa: Crisis Response through Computational Identification and Matching of Cross-Lingual Requests and Offers Shared on Social Media

Figure 2 for CReMa: Crisis Response through Computational Identification and Matching of Cross-Lingual Requests and Offers Shared on Social Media

Figure 3 for CReMa: Crisis Response through Computational Identification and Matching of Cross-Lingual Requests and Offers Shared on Social Media

Figure 4 for CReMa: Crisis Response through Computational Identification and Matching of Cross-Lingual Requests and Offers Shared on Social Media

Abstract:During times of crisis, social media platforms play a vital role in facilitating communication and coordinating resources. Amidst chaos and uncertainty, communities often rely on these platforms to share urgent pleas for help, extend support, and organize relief efforts. However, the sheer volume of conversations during such periods, which can escalate to unprecedented levels, necessitates the automated identification and matching of requests and offers to streamline relief operations. This study addresses the challenge of efficiently identifying and matching assistance requests and offers on social media platforms during emergencies. We propose CReMa (Crisis Response Matcher), a systematic approach that integrates textual, temporal, and spatial features for multi-lingual request-offer matching. By leveraging CrisisTransformers, a set of pre-trained models specific to crises, and a cross-lingual embedding space, our methodology enhances the identification and matching tasks while outperforming strong baselines such as RoBERTa, MPNet, and BERTweet, in classification tasks, and Universal Sentence Encoder, Sentence Transformers in crisis embeddings generation tasks. We introduce a novel multi-lingual dataset that simulates scenarios of help-seeking and offering assistance on social media across the 16 most commonly used languages in Australia. We conduct comprehensive cross-lingual experiments across these 16 languages, also while examining trade-offs between multiple vector search strategies and accuracy. Additionally, we analyze a million-scale geotagged global dataset to comprehend patterns in relation to seeking help and offering assistance on social media. Overall, these contributions advance the field of crisis informatics and provide benchmarks for future research in the area.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts

Mar 25, 2024

Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera

Figure 1 for Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts

Figure 2 for Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts

Figure 3 for Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts

Figure 4 for Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts

Abstract:Tasks such as semantic search and clustering on crisis-related social media texts enhance our comprehension of crisis discourse, aiding decision-making and targeted interventions. Pre-trained language models have advanced performance in crisis informatics, but their contextual embeddings lack semantic meaningfulness. Although the CrisisTransformers family includes a sentence encoder to address the semanticity issue, it remains monolingual, processing only English texts. Furthermore, employing separate models for different languages leads to embeddings in distinct vector spaces, introducing challenges when comparing semantic similarities between multi-lingual texts. Therefore, we propose multi-lingual sentence encoders (CT-XLMR-SE and CT-mBERT-SE) that embed crisis-related social media texts for over 50 languages, such that texts with similar meanings are in close proximity within the same vector space, irrespective of language diversity. Results in sentence encoding and sentence matching tasks are promising, suggesting these models could serve as robust baselines when embedding multi-lingual crisis-related social media texts. The models are publicly available at: https://huggingface.co/crisistransformers.

* Accepted to ISCRAM 2024

Via

Access Paper or Ask Questions

Urban Region Representation Learning with Attentive Fusion

Dec 07, 2023

Fengze Sun, Jianzhong Qi, Yanchuan Chang, Xiaoliang Fan, Shanika Karunasekera, Egemen Tanin

Figure 1 for Urban Region Representation Learning with Attentive Fusion

Figure 2 for Urban Region Representation Learning with Attentive Fusion

Figure 3 for Urban Region Representation Learning with Attentive Fusion

Figure 4 for Urban Region Representation Learning with Attentive Fusion

Abstract:An increasing number of related urban data sources have brought forth novel opportunities for learning urban region representations, i.e., embeddings. The embeddings describe latent features of urban regions and enable discovering similar regions for urban planning applications. Existing methods learn an embedding for a region using every different type of region feature data, and subsequently fuse all learned embeddings of a region to generate a unified region embedding. However, these studies often overlook the significance of the fusion process. The typical fusion methods rely on simple aggregation, such as summation and concatenation, thereby disregarding correlations within the fused region embeddings. To address this limitation, we propose a novel model named HAFusion. Our model is powered by a dual-feature attentive fusion module named DAFusion, which fuses embeddings from different region features to learn higher-order correlations between the regions as well as between the different types of region features. DAFusion is generic - it can be integrated into existing models to enhance their fusion process. Further, motivated by the effective fusion capability of an attentive module, we propose a hybrid attentive feature learning module named HALearning to enhance the embedding learning from each individual type of region features. Extensive experiments on three real-world datasets demonstrate that our model HAFusion outperforms state-of-the-art methods across three different prediction tasks. Using our learned region embedding leads to consistent and up to 31% improvements in the prediction accuracy.

Via

Access Paper or Ask Questions

CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Oct 01, 2023

Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera

Figure 1 for CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Figure 2 for CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Figure 3 for CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Figure 4 for CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Abstract:Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddings, regardless of the textual complexities in crisis-related texts. Advances in applications like text classification, semantic search, and clustering contribute to effective processing of crisis-related texts, which is essential for emergency responders to gain a comprehensive view of a crisis event, whether historical or real-time. To address these gaps in crisis informatics literature, this study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents. We evaluate existing models and CrisisTransformers on 18 crisis-specific public datasets. Our pre-trained models outperform strong baselines across all datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks. Additionally, we investigate the impact of model initialization on convergence and evaluate the significance of domain-specific models in generating semantically meaningful sentence embeddings. All models are publicly released (https://huggingface.co/crisistransformers), with the anticipation that they will serve as a robust baseline for tasks involving the analysis of crisis-related social media texts.

Via

Access Paper or Ask Questions

Unsupervised Domain-agnostic Fake News Detection using Multi-modal Weak Signals

May 18, 2023

Amila Silva, Ling Luo, Shanika Karunasekera, Christopher Leckie

Figure 1 for Unsupervised Domain-agnostic Fake News Detection using Multi-modal Weak Signals

Figure 2 for Unsupervised Domain-agnostic Fake News Detection using Multi-modal Weak Signals

Figure 3 for Unsupervised Domain-agnostic Fake News Detection using Multi-modal Weak Signals

Figure 4 for Unsupervised Domain-agnostic Fake News Detection using Multi-modal Weak Signals

Abstract:The emergence of social media as one of the main platforms for people to access news has enabled the wide dissemination of fake news. This has motivated numerous studies on automating fake news detection. Although there have been limited attempts at unsupervised fake news detection, their performance suffers due to not exploiting the knowledge from various modalities related to news records and due to the presence of various latent biases in the existing news datasets. To address these limitations, this work proposes an effective framework for unsupervised fake news detection, which first embeds the knowledge available in four modalities in news records and then proposes a novel noise-robust self-supervised learning technique to identify the veracity of news records from the multi-modal embeddings. Also, we propose a novel technique to construct news datasets minimizing the latent biases in existing news datasets. Following the proposed approach for dataset construction, we produce a Large-scale Unlabelled News Dataset consisting 419,351 news articles related to COVID-19, acronymed as LUND-COVID. We trained the proposed unsupervised framework using LUND-COVID to exploit the potential of large datasets, and evaluate it using a set of existing labelled datasets. Our results show that the proposed unsupervised framework largely outperforms existing unsupervised baselines for different tasks such as multi-modal fake news detection, fake news early detection and few-shot fake news detection, while yielding notable improvements for unseen domains during training.

* 15 pages

Via

Access Paper or Ask Questions