Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emmanuel Noutahi

A Cross Modal Knowledge Distillation & Data Augmentation Recipe for Improving Transcriptomics Representations through Morphological Features

May 27, 2025

Ihab Bendidi, Yassir El Mesbahi, Alisandra K. Denton, Karush Suri, Kian Kenyon-Dean, Auguste Genovesio, Emmanuel Noutahi

Abstract:Understanding cellular responses to stimuli is crucial for biological discovery and drug development. Transcriptomics provides interpretable, gene-level insights, while microscopy imaging offers rich predictive features but is harder to interpret. Weakly paired datasets, where samples share biological states, enable multimodal learning but are scarce, limiting their utility for training and multimodal inference. We propose a framework to enhance transcriptomics by distilling knowledge from microscopy images. Using weakly paired data, our method aligns and binds modalities, enriching gene expression representations with morphological information. To address data scarcity, we introduce (1) Semi-Clipped, an adaptation of CLIP for cross-modal distillation using pretrained foundation models, achieving state-of-the-art results, and (2) PEA (Perturbation Embedding Augmentation), a novel augmentation technique that enhances transcriptomics data while preserving inherent biological information. These strategies improve the predictive power and retain the interpretability of transcriptomics, enabling rich unimodal representations for complex biological tasks.

* ICML 2025 Main Proceedings

Via

Access Paper or Ask Questions

Virtual Cells: Predict, Explain, Discover

May 20, 2025

Emmanuel Noutahi, Jason Hartford, Prudencio Tossou, Shawn Whitfield, Alisandra K. Denton, Cas Wognum, Kristina Ulicna, Jonathan Hsu, Michael Cuccarese, Emmanuel Bengio(+4 more)

Abstract:Drug discovery is fundamentally a process of inferring the effects of treatments on patients, and would therefore benefit immensely from computational models that can reliably simulate patient responses, enabling researchers to generate and test large numbers of therapeutic hypotheses safely and economically before initiating costly clinical trials. Even a more specific model that predicts the functional response of cells to a wide range of perturbations would be tremendously valuable for discovering safe and effective treatments that successfully translate to the clinic. Creating such virtual cells has long been a goal of the computational research community that unfortunately remains unachieved given the daunting complexity and scale of cellular biology. Nevertheless, recent advances in AI, computing power, lab automation, and high-throughput cellular profiling provide new opportunities for reaching this goal. In this perspective, we present a vision for developing and evaluating virtual cells that builds on our experience at Recursion. We argue that in order to be a useful tool to discover novel biology, virtual cells must accurately predict the functional response of a cell to perturbations and explain how the predicted response is a consequence of modifications to key biomolecular interactions. We then introduce key principles for designing therapeutically-relevant virtual cells, describe a lab-in-the-loop approach for generating novel insights with them, and advocate for biologically-grounded benchmarks to guide virtual cell development. Finally, we make the case that our approach to virtual cells provides a useful framework for building other models at higher levels of organization, including virtual patients. We hope that these directions prove useful to the research community in developing virtual models optimized for positive impact on drug discovery outcomes.

Via

Access Paper or Ask Questions

TxPert: Leveraging Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation Prediction

May 20, 2025

Frederik Wenkel, Wilson Tu, Cassandra Masschelein, Hamed Shirzad, Cian Eastwood, Shawn T. Whitfield, Ihab Bendidi, Craig Russell, Liam Hodgson, Yassir El Mesbahi(+5 more)

Abstract:Accurately predicting cellular responses to genetic perturbations is essential for understanding disease mechanisms and designing effective therapies. Yet exhaustively exploring the space of possible perturbations (e.g., multi-gene perturbations or across tissues and cell types) is prohibitively expensive, motivating methods that can generalize to unseen conditions. In this work, we explore how knowledge graphs of gene-gene relationships can improve out-of-distribution (OOD) prediction across three challenging settings: unseen single perturbations; unseen double perturbations; and unseen cell lines. In particular, we present: (i) TxPert, a new state-of-the-art method that leverages multiple biological knowledge networks to predict transcriptional responses under OOD scenarios; (ii) an in-depth analysis demonstrating the impact of graphs, model architecture, and data on performance; and (iii) an expanded benchmarking framework that strengthens evaluation standards for perturbation modeling.

Via

Access Paper or Ask Questions

Graph and Simplicial Complex Prediction Gaussian Process via the Hodgelet Representations

May 16, 2025

Mathieu Alain, So Takao, Xiaowen Dong, Bastian Rieck, Emmanuel Noutahi

Abstract:Predicting the labels of graph-structured data is crucial in scientific applications and is often achieved using graph neural networks (GNNs). However, when data is scarce, GNNs suffer from overfitting, leading to poor performance. Recently, Gaussian processes (GPs) with graph-level inputs have been proposed as an alternative. In this work, we extend the Gaussian process framework to simplicial complexes (SCs), enabling the handling of edge-level attributes and attributes supported on higher-order simplices. We further augment the resulting SC representations by considering their Hodge decompositions, allowing us to account for homological information, such as the number of holes, in the SC. We demonstrate that our framework enhances the predictions across various applications, paving the way for GPs to be more widely used for graph and SC-level predictions.

Via

Access Paper or Ask Questions

SAFE setup for generative molecular design

Oct 26, 2024

Yassir El Mesbahi, Emmanuel Noutahi

Figure 1 for SAFE setup for generative molecular design

Figure 2 for SAFE setup for generative molecular design

Figure 3 for SAFE setup for generative molecular design

Figure 4 for SAFE setup for generative molecular design

Abstract:SMILES-based molecular generative models have been pivotal in drug design but face challenges in fragment-constrained tasks. To address this, the Sequential Attachment-based Fragment Embedding (SAFE) representation was recently introduced as an alternative that streamlines those tasks. In this study, we investigate the optimal setups for training SAFE generative models, focusing on dataset size, data augmentation through randomization, model architecture, and bond disconnection algorithms. We found that larger, more diverse datasets improve performance, with the LLaMA architecture using Rotary Positional Embedding proving most robust. SAFE-based models also consistently outperform SMILES-based approaches in scaffold decoration and linker design, particularly with BRICS decomposition yielding the best results. These insights highlight key factors that significantly impact the efficacy of SAFE-based generative models.

Via

Access Paper or Ask Questions

Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all

Oct 17, 2024

Ihab Bendidi, Shawn Whitfield, Kian Kenyon-Dean, Hanene Ben Yedder, Yassir El Mesbahi, Emmanuel Noutahi, Alisandra K. Denton

Figure 1 for Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all

Figure 2 for Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all

Figure 3 for Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all

Figure 4 for Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all

Abstract:Understanding the relationships among genes, compounds, and their interactions in living organisms remains limited due to technological constraints and the complexity of biological data. Deep learning has shown promise in exploring these relationships using various data types. However, transcriptomics, which provides detailed insights into cellular states, is still underused due to its high noise levels and limited data availability. Recent advancements in transcriptomics sequencing provide new opportunities to uncover valuable insights, especially with the rise of many new foundation models for transcriptomics, yet no benchmark has been made to robustly evaluate the effectiveness of these rising models for perturbation analysis. This article presents a novel biologically motivated evaluation framework and a hierarchy of perturbation analysis tasks for comparing the performance of pretrained foundation models to each other and to more classical techniques of learning from transcriptomics data. We compile diverse public datasets from different sequencing techniques and cell lines to assess models performance. Our approach identifies scVI and PCA to be far better suited models for understanding biological perturbations in comparison to existing foundation models, especially in their application in real-world scenarios.

* Neurips 2024 AIDrugX Workshop

Via

Access Paper or Ask Questions

Role of Structural and Conformational Diversity for Machine Learning Potentials

Oct 30, 2023

Nikhil Shenoy, Prudencio Tossou, Emmanuel Noutahi, Hadrien Mary, Dominique Beaini, Jiarui Ding

Figure 1 for Role of Structural and Conformational Diversity for Machine Learning Potentials

Figure 2 for Role of Structural and Conformational Diversity for Machine Learning Potentials

Figure 3 for Role of Structural and Conformational Diversity for Machine Learning Potentials

Figure 4 for Role of Structural and Conformational Diversity for Machine Learning Potentials

Abstract:In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal nuanced patterns in generalization metrics. Notably, for optimal structural and conformational generalization, a careful balance between structural and conformational diversity is required, but existing QM datasets do not meet that trade-off. Additionally, our results highlight the limitation of the MLIP models at generalizing beyond their training distribution, emphasizing the importance of defining applicability domain during model deployment. These findings provide valuable insights and guidelines for QM data generation efforts.

* Accepted at NeurIPS 2023 AI4D3 and AI4S workshops

Via

Access Paper or Ask Questions

Gotta be SAFE: A New Framework for Molecular Design

Oct 16, 2023

Emmanuel Noutahi, Cristian Gabellini, Michael Craig, Jonathan S. C Lim, Prudencio Tossou

Figure 1 for Gotta be SAFE: A New Framework for Molecular Design

Figure 2 for Gotta be SAFE: A New Framework for Molecular Design

Figure 3 for Gotta be SAFE: A New Framework for Molecular Design

Figure 4 for Gotta be SAFE: A New Framework for Molecular Design

Abstract:Traditional molecular string representations, such as SMILES, often pose challenges for AI-driven molecular design due to their non-sequential depiction of molecular substructures. To address this issue, we introduce Sequential Attachment-based Fragment Embedding (SAFE), a novel line notation for chemical structures. SAFE reimagines SMILES strings as an unordered sequence of interconnected fragment blocks while maintaining full compatibility with existing SMILES parsers. It streamlines complex generative tasks, including scaffold decoration, fragment linking, polymer generation, and scaffold hopping, while facilitating autoregressive generation for fragment-constrained design, thereby eliminating the need for intricate decoding or graph-based models. We demonstrate the effectiveness of SAFE by training an 87-million-parameter GPT2-like model on a dataset containing 1.1 billion SAFE representations. Through extensive experimentation, we show that our SAFE-GPT model exhibits versatile and robust optimization performance. SAFE opens up new avenues for the rapid exploration of chemical space under various constraints, promising breakthroughs in AI-driven molecular design.

* Submitted to a workshop at Neurips 2023

Via

Access Paper or Ask Questions

Molecular Design in Synthetically Accessible Chemical Space via Deep Reinforcement Learning

Apr 29, 2020

Julien Horwood, Emmanuel Noutahi

Figure 1 for Molecular Design in Synthetically Accessible Chemical Space via Deep Reinforcement Learning

Figure 2 for Molecular Design in Synthetically Accessible Chemical Space via Deep Reinforcement Learning

Figure 3 for Molecular Design in Synthetically Accessible Chemical Space via Deep Reinforcement Learning

Figure 4 for Molecular Design in Synthetically Accessible Chemical Space via Deep Reinforcement Learning

Abstract:The fundamental goal of generative drug design is to propose optimized molecules that meet predefined activity, selectivity, and pharmacokinetic criteria. Despite recent progress, we argue that existing generative methods are limited in their ability to favourably shift the distributions of molecular properties during optimization. We instead propose a novel Reinforcement Learning framework for molecular design in which an agent learns to directly optimize through a space of synthetically-accessible drug-like molecules. This becomes possible by defining transitions in our Markov Decision Process as chemical reactions, and allows us to leverage synthetic routes as an inductive bias. We validate our method by demonstrating that it outperforms existing state-of the art approaches in the optimization of pharmacologically-relevant objectives, while results on multi-objective optimization tasks suggest increased scalability to realistic pharmaceutical design problems.

Via

Access Paper or Ask Questions

Towards Interpretable Sparse Graph Representation Learning with Laplacian Pooling

May 28, 2019

Emmanuel Noutahi, Dominique Beani, Julien Horwood, Prudencio Tossou

Figure 1 for Towards Interpretable Sparse Graph Representation Learning with Laplacian Pooling

Figure 2 for Towards Interpretable Sparse Graph Representation Learning with Laplacian Pooling

Figure 3 for Towards Interpretable Sparse Graph Representation Learning with Laplacian Pooling

Figure 4 for Towards Interpretable Sparse Graph Representation Learning with Laplacian Pooling

Abstract:Recent work in graph neural networks (GNNs) has lead to improvements in molecular activity and property prediction tasks. However, GNNs lack interpretability as they fail to capture the relative importance of various molecular substructures due to the absence of efficient intermediate pooling steps for sparse graphs. To address this issue, we propose LaPool (Laplacian Pooling), a novel, data-driven, and interpretable graph pooling method that takes into account the node features and graph structure to improve molecular understanding. Inspired by theories in graph signal processing, LaPool performs a feature-driven hierarchical segmentation of molecules by selecting a set of centroid nodes from a graph as cluster representatives. It then learns a sparse assignment of remaining nodes into these clusters using an attention mechanism. We benchmark our model by showing that it outperforms recent graph pooling layers on molecular graph understanding and prediction tasks. We then demonstrate improved interpretability by identifying important molecular substructures and generating novel and valid molecules, with important applications in drug discovery and pharmacology.

* 8 pages, with Appendices

Via

Access Paper or Ask Questions