Finding potential research collaborators is a challenging task, especially in today's fast-growing and interdisciplinary research landscape. While traditional methods often rely on observable relationships such as co-authorships and citations to construct the research network, in this work, we focus solely on publication content to build a topic-based research network using BERTopic with a fine-tuned SciBERT model that connects and recommends researchers across disciplines based on shared topical interests. A major challenge we address is publication imbalance, where some researchers publish much more than others, often across several topics. Without careful handling, their less frequent interests are hidden under dominant topics, limiting the network's ability to detect their full research scope. To tackle this, we introduce a cloning strategy that clusters a researcher's publications and treats each cluster as a separate node. This allows researchers to be part of multiple communities, improving the detection of interdisciplinary links. Evaluation on the proposed method shows that the cloned network structure leads to more meaningful communities and uncovers a broader set of collaboration opportunities.
Automated voice calls with health information are a proven method for disseminating maternal and child health information among beneficiaries and are deployed in several programs around the world. However, these programs often suffer from beneficiary dropoffs and poor engagement. In previous work, through real-world trials, we showed that an AI model, specifically a restless bandit model, could identify beneficiaries who would benefit most from live service call interventions, preventing dropoffs and boosting engagement. However, one key question has remained open so far: does such improved listenership via AI-targeted interventions translate into beneficiaries' improved knowledge and health behaviors? We present a first study that shows not only listenership improvements due to AI interventions, but also simultaneously links these improvements to health behavior changes. Specifically, we demonstrate that AI-scheduled interventions, which enhance listenership, lead to statistically significant improvements in beneficiaries' health behaviors such as taking iron or calcium supplements in the postnatal period, as well as understanding of critical health topics during pregnancy and infancy. This underscores the potential of AI to drive meaningful improvements in maternal and child health.
Natural Language Understanding (NLU) is a basic task in Natural Language Processing (NLP). The evaluation of NLU capabilities has become a trending research topic that attracts researchers in the last few years, resulting in the development of numerous benchmarks. These benchmarks include various tasks and datasets in order to evaluate the results of pretrained models via public leaderboards. Notably, several benchmarks contain diagnostics datasets designed for investigation and fine-grained error analysis across a wide range of linguistic phenomena. This survey provides a comprehensive review of available English, Arabic, and Multilingual NLU benchmarks, with a particular emphasis on their diagnostics datasets and the linguistic phenomena they covered. We present a detailed comparison and analysis of these benchmarks, highlighting their strengths and limitations in evaluating NLU tasks and providing in-depth error analysis. When highlighting the gaps in the state-of-the-art, we noted that there is no naming convention for macro and micro categories or even a standard set of linguistic phenomena that should be covered. Consequently, we formulated a research question regarding the evaluation metrics of the evaluation diagnostics benchmarks: "Why do not we have an evaluation standard for the NLU evaluation diagnostics benchmarks?" similar to ISO standard in industry. We conducted a deep analysis and comparisons of the covered linguistic phenomena in order to support experts in building a global hierarchy for linguistic phenomena in future. We think that having evaluation metrics for diagnostics evaluation could be valuable to gain more insights when comparing the results of the studied models on different diagnostics benchmarks.
Visual Question Answering (VQA) has been a widely studied topic, with extensive research focusing on how VLMs respond to answerable questions based on real-world images. However, there has been limited exploration of how these models handle unanswerable questions, particularly in cases where they should abstain from providing a response. This research investigates VQA performance on unrealistically generated images or asking unanswerable questions, assessing whether models recognize the limitations of their knowledge or attempt to generate incorrect answers. We introduced a dataset, VisionTrap, comprising three categories of unanswerable questions across diverse image types: (1) hybrid entities that fuse objects and animals, (2) objects depicted in unconventional or impossible scenarios, and (3) fictional or non-existent figures. The questions posed are logically structured yet inherently unanswerable, testing whether models can correctly recognize their limitations. Our findings highlight the importance of incorporating such questions into VQA benchmarks to evaluate whether models tend to answer, even when they should abstain.
Fluid antenna systems (FASs) have become a popular topic in the wireless community as an effective yet simple means of exploiting spatial diversity. Due to the limitations of physically moving radiating elements, electronically reconfigurable antennas are emerging as practical implementations of FASs, since changing the radiation pattern is functionally equivalent to physically moving the device. However, electronically reconfigurable antennas pose a challenge in terms of analytical modeling, often requiring full-wave simulations or measurements for their characterization; this severely limits the extraction of theoretical insights useful for system design. Motivated by these difficulties and the growing interest in FASs, we propose in this paper a complete analytical model for metasurface-based embodiments of FASs. Specifically, we advocate for the implementation of the FAS concept through dynamic metasurface antennas (DMAs), hitherto proposed as array replacements in multiple-input multiple-output (MIMO) systems. We leverage circuit theory to rewrite the conventional signal model of FASs in terms of admittance matrices accounting for the electromagnetic effects inherent to metasurfaces. The model is validated with full-wave simulations, showing good agreement. We further illustrate how to apply the model for standard performance analysis, and provide closed-form expressions for key metrics, including the resulting signal covariance matrix. Results confirm that practical DMA-based FASs can achieve similar performance to that of idealized implementations of position-flexible antennas.
Hierarchical text classification (HTC) is a natural language processing task which has the objective of categorising text documents into a set of classes from a predefined structured class hierarchy. Recent HTC approaches use various techniques to incorporate the hierarchical class structure information with the natural language understanding capabilities of pre-trained language models (PLMs) to improve classification performance. Furthermore, using topic models along with PLMs to extract features from text documents has been shown to be an effective approach for multi-label text classification tasks. The rationale behind the combination of these feature extractor models is that the PLM captures the finer-grained contextual and semantic information while the topic model obtains high-level representations which consider the corpus of documents as a whole. In this paper, we use a HTC approach which uses a PLM and a topic model to extract features from text documents which are used to train a classification model. Our objective is to determine whether the combination of the features extracted from the two models is beneficial to HTC performance in general. In our approach, the extracted features are passed through separate convolutional layers whose outputs are combined and passed to a label-wise attention mechanisms which obtains label-specific document representations by weighing the most important features for each class separately. We perform comprehensive experiments on three HTC benchmark datasets and show that using the features extracted from the topic model generally decreases classification performance compared to only using the features obtained by the PLM. In contrast to previous work, this shows that the incorporation of features extracted from topic models for text classification tasks should not be assumed beneficial.
The DEDICOM algorithm provides a uniquely interpretable matrix factorization method for symmetric and asymmetric square matrices. We employ a new row-stochastic variation of DEDICOM on the pointwise mutual information matrices of text corpora to identify latent topic clusters within the vocabulary and simultaneously learn interpretable word embeddings. We introduce a method to efficiently train a constrained DEDICOM algorithm and a qualitative evaluation of its topic modeling and word embedding performance.
Bridge maintenance and safety are essential for transportation authorities, and Non-Destructive Evaluation (NDE) techniques are critical to assessing structural integrity. However, interpreting NDE data can be time-consuming and requires expertise, potentially delaying decision-making. Recent advancements in Large Language Models (LLMs) offer new ways to automate and improve this analysis. This pilot study introduces a holistic assessment of LLM capabilities for interpreting NDE contour maps and demonstrates the effectiveness of LLMs in providing detailed bridge condition analyses. It establishes a framework for integrating LLMs into bridge inspection workflows, indicating that LLM-assisted analysis can enhance efficiency without compromising accuracy. In this study, several LLMs are explored with prompts specifically designed to enhance the quality of image descriptions, which are applied to interpret five different NDE contour maps obtained through technologies for assessing bridge conditions. Each LLM model is evaluated based on its ability to produce detailed descriptions, identify defects, provide actionable recommendations, and demonstrate overall accuracy. The research indicates that four of the nine models provide better image descriptions, effectively covering a wide range of topics related to the bridge's condition. The outputs from these four models are summarized using five different LLMs to form a comprehensive overview of the bridge. Notably, LLMs ChatGPT-4 and Claude 3.5 Sonnet generate more effective summaries. The findings suggest that LLMs have the potential to significantly improve efficiency and accuracy. This pilot study presents an innovative approach that leverages LLMs for image captioning in parallel and summarization, enabling faster decision-making in bridge maintenance and enhancing infrastructure management and safety assessments.




As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model. We present a novel framework, Preference Learning Using Summarization (PLUS), that learns text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. We train the user-summarization model with reinforcement learning, and update the reward model simultaneously, creating an online co-adaptation loop. We show that in contrast with prior personalized RLHF techniques or with in-context learning of user information, summaries produced by PLUS capture meaningful aspects of a user's preferences. Across different pluralistic user datasets, we show that our method is robust to new users and diverse conversation topics. Additionally, we demonstrate that the textual summaries generated about users can be transferred for zero-shot personalization of stronger, proprietary models like GPT-4. The resulting user summaries are not only concise and portable, they are easy for users to interpret and modify, allowing for more transparency and user control in LLM alignment.




Humour, as a complex language form, is derived from myriad aspects of life, whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes. In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. In doing so, we curate a dataset of 600 jokes split across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes, where understanding relies on reasoning beyond "common sense", rooted instead in world knowledge regarding news events and pop culture. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most works in computational humour on overly simple joke forms.