Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hanlin Xiao

Cross-Granularity Representations for Biological Sequences: Insights from ESM and BiGCARP

Mar 21, 2026

Hanlin Xiao, Rainer Breitling, Eriko Takano, Mauricio A. Álvarez

Abstract:Recent advances in general-purpose foundation models have stimulated the development of large biological sequence models. While natural language shows symbolic granularity (characters, words, sentences), biological sequences exhibit hierarchical granularity whose levels (nucleotides, amino acids, protein domains, genes) further encode biologically functional information. In this paper, we investigate the integration of cross-granularity knowledge from models through a case study of BiGCARP, a Pfam domain-level model for biosynthetic gene clusters, and ESM, an amino acid-level protein language model. Using representation analysis tools and a set of probe tasks, we first explain why a straightforward cross-model embedding initialization fails to improve downstream performance in BiGCARP, and show that deeper-layer embeddings capture a more contextual and faithful representation of the model's learned knowledge. Furthermore, we demonstrate that representations at different granularities encode complementary biological knowledge, and that combining them yields measurable performance gains in intermediate-level prediction tasks. Our findings highlight cross-granularity integration as a promising strategy for improving both the performance and interpretability of biological foundation models.

* Proc. IEEE BIBM (2025) 6936-6943
* 9 pages, 4 figures, published in 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Via

Access Paper or Ask Questions

Disentangling Similarity and Relatedness in Topic Models

Mar 11, 2026

Hanlin Xiao, Mauricio A. Álvarez, Rainer Breitling

Abstract:The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.

* 22 pages, 6 figures, 14 tables

Via

Access Paper or Ask Questions

Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support

Feb 26, 2025

Guoxin Wang, Minyu Gao, Shuai Yang, Ya Zhang, Lizhi He, Liang Huang, Hanlin Xiao, Yexuan Zhang, Wanyue Li, Lu Chen(+2 more)

Abstract:Large language models (LLMs), particularly those with reasoning capabilities, have rapidly advanced in recent years, demonstrating significant potential across a wide range of applications. However, their deployment in healthcare, especially in disease reasoning tasks, is hindered by the challenge of acquiring expert-level cognitive data. In this paper, we introduce Citrus, a medical language model that bridges the gap between clinical expertise and AI reasoning by emulating the cognitive processes of medical experts. The model is trained on a large corpus of simulated expert disease reasoning data, synthesized using a novel approach that accurately captures the decision-making pathways of clinicians. This approach enables Citrus to better simulate the complex reasoning processes involved in diagnosing and treating medical conditions. To further address the lack of publicly available datasets for medical reasoning tasks, we release the last-stage training data, including a custom-built medical diagnostic dialogue dataset. This open-source contribution aims to support further research and development in the field. Evaluations using authoritative benchmarks such as MedQA, covering tasks in medical reasoning and language understanding, show that Citrus achieves superior performance compared to other models of similar size. These results highlight Citrus potential to significantly enhance medical decision support systems, providing a more accurate and efficient tool for clinical decision-making.

Via

Access Paper or Ask Questions

What Really is `Molecule' in Molecular Communications? The Quest for Physics of Particle-based Information Carriers

Dec 03, 2023

Hanlin Xiao, Kamela Dokaj, Ozgur B. Akan

Figure 1 for What Really is `Molecule' in Molecular Communications? The Quest for Physics of Particle-based Information Carriers

Figure 2 for What Really is `Molecule' in Molecular Communications? The Quest for Physics of Particle-based Information Carriers

Figure 3 for What Really is `Molecule' in Molecular Communications? The Quest for Physics of Particle-based Information Carriers

Figure 4 for What Really is `Molecule' in Molecular Communications? The Quest for Physics of Particle-based Information Carriers

Abstract:Molecular communication, as implied by its name, uses molecules as information carriers for communication between objects. It has an advantage over traditional electromagnetic-wave-based communication in that molecule-based systems could be biocompatible, operable in challenging environments, and energetically undemanding. Consequently, they are envisioned to have a broad range of applications, such as in the Internet of Bio-nano Things, targeted drug delivery, and agricultural monitoring. Despite the rapid development of the field, with an increasing number of theoretical models and experimental testbeds established by researchers, a fundamental aspect of the field has often been sidelined, namely, the nature of the molecule in molecular communication. The potential information molecules could exhibit a wide range of properties, making them require drastically different treatments when being modeled and experimented upon. Therefore, in this paper, we delve into the intricacies of commonly used information molecules, examining their fundamental physical characteristics, associated communication systems, and potential applications in a more realistic manner, focusing on the influence of their own properties. Through this comprehensive survey, we aim to offer a novel yet essential perspective on molecular communication, thereby bridging the current gap between theoretical research and real-world applications.

Via

Access Paper or Ask Questions