Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mohammed Bennamoun

Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

Apr 26, 2025

Shahad Albastaki, Anabia Sohail, Iyyakutti Iyappan Ganapathi, Basit Alawode, Asim Khan, Sajid Javed, Naoufel Werghi, Mohammed Bennamoun, Arif Mahmood

Figure 1 for Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

Figure 2 for Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

Figure 3 for Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

Figure 4 for Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

Abstract:In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations. Cross-resolution alignment using a multimodal encoder enhances the model's ability to capture context from multiple resolutions in histology images. Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms state-of-the-art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at: https://github.com/BasitAlawode/MR-PLIP

Via

Access Paper or Ask Questions

Polarisation-Inclusive Spiking Neural Networks for Real-Time RFI Detection in Modern Radio Telescopes

Apr 16, 2025

Nicholas J. Pritchard, Andreas Wicenec, Richard Dodson, Mohammed Bennamoun

Abstract:Radio Frequency Interference (RFI) is a known growing challenge for radio astronomy, intensified by increasing observatory sensitivity and prevalence of orbital RFI sources. Spiking Neural Networks (SNNs) offer a promising solution for real-time RFI detection by exploiting the time-varying nature of radio observation and neuron dynamics together. This work explores the inclusion of polarisation information in SNN-based RFI detection, using simulated data from the Hydrogen Epoch of Reionisation Array (HERA) instrument and provides power usage estimates for deploying SNN-based RFI detection on existing neuromorphic hardware. Preliminary results demonstrate state-of-the-art detection accuracy and highlight possible extensive energy-efficiency gains.

* 4 pages, 4 tables, accepted at URSI AP-RASC 2025

Via

Access Paper or Ask Questions

Advancing RFI-Detection in Radio Astronomy with Liquid State Machines

Apr 14, 2025

Nicholas J Pritchard, Andreas Wicenec, Mohammed Bennamoun, Richard Dodson

Abstract:Radio Frequency Interference (RFI) from anthropogenic radio sources poses significant challenges to current and future radio telescopes. Contemporary approaches to detecting RFI treat the task as a semantic segmentation problem on radio telescope spectrograms. Typically, complex heuristic algorithms handle this task of `flagging' in combination with manual labeling (in the most difficult cases). While recent machine-learning approaches have demonstrated high accuracy, they often fail to meet the stringent operational requirements of modern radio observatories. Owing to their inherently time-varying nature, spiking neural networks (SNNs) are a promising alternative method to RFI-detection by utilizing the time-varying nature of the spectrographic source data. In this work, we apply Liquid State Machines (LSMs), a class of spiking neural networks, to RFI-detection. We employ second-order Leaky Integrate-and-Fire (LiF) neurons, marking the first use of this architecture and neuron type for RFI-detection. We test three encoding methods and three increasingly complex readout layers, including a transformer decoder head, providing a hybrid of SNN and ANN techniques. Our methods extend LSMs beyond conventional classification tasks to fine-grained spatio-temporal segmentation. We train LSMs on simulated data derived from the Hyrogen Epoch of Reionization Array (HERA), a known benchmark for RFI-detection. Our model achieves a per-pixel accuracy of 98% and an F1-score of 0.743, demonstrating competitive performance on this highly challenging task. This work expands the sophistication of SNN techniques and architectures applied to RFI-detection, and highlights the effectiveness of LSMs in handling fine-grained, complex, spatio-temporal signal-processing tasks.

* 7 pages, 2 figures, 5 tables, accepted for publication at IJCNN 2025

Via

Access Paper or Ask Questions

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

Apr 03, 2025

Divya Velayudhan, Abdelfatah Ahmed, Mohamad Alansari, Neha Gour, Abderaouf Behouch, Taimur Hassan, Syed Talal Wasim, Nabil Maalej, Muzammal Naseer, Juergen Gall(+3 more)

Abstract:Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at https://divs1159.github.io/STING-BEE/.

* Accepted at CVPR 2025

Via

Access Paper or Ask Questions

Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis

Mar 05, 2025

Awais Nizamani, Hamid Laga, Guanjin Wang, Farid Boussaid, Mohammed Bennamoun, Anuj Srivastava

Abstract:We propose a novel framework for the statistical analysis of genus-zero 4D surfaces, i.e., 3D surfaces that deform and evolve over time. This problem is particularly challenging due to the arbitrary parameterizations of these surfaces and their varying deformation speeds, necessitating effective spatiotemporal registration. Traditionally, 4D surfaces are discretized, in space and time, before computing their spatiotemporal registrations, geodesics, and statistics. However, this approach may result in suboptimal solutions and, as we demonstrate in this paper, is not necessary. In contrast, we treat 4D surfaces as continuous functions in both space and time. We introduce Dynamic Spherical Neural Surfaces (D-SNS), an efficient smooth and continuous spatiotemporal representation for genus-0 4D surfaces. We then demonstrate how to perform core 4D shape analysis tasks such as spatiotemporal registration, geodesics computation, and mean 4D shape estimation, directly on these continuous representations without upfront discretization and meshing. By integrating neural representations with classical Riemannian geometry and statistical shape analysis techniques, we provide the building blocks for enabling full functional shape analysis. We demonstrate the efficiency of the framework on 4D human and face datasets. The source code and additional results are available at https://4d-dsns.github.io/DSNS/.

* CVPR 2025
* 22 pages, 23 figures, conference paper

Via

Access Paper or Ask Questions

AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis

Feb 03, 2025

Basit Alawode, Iyyakutti Iyappan Ganapathi, Sajid Javed, Naoufel Werghi, Mohammed Bennamoun, Arif Mahmood

Figure 1 for AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis

Figure 2 for AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis

Figure 3 for AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis

Figure 4 for AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis

Abstract:The preservation of aquatic biodiversity is critical in mitigating the effects of climate change. Aquatic scene understanding plays a pivotal role in aiding marine scientists in their decision-making processes. In this paper, we introduce AquaticCLIP, a novel contrastive language-image pre-training model tailored for aquatic scene understanding. AquaticCLIP presents a new unsupervised learning framework that aligns images and texts in aquatic environments, enabling tasks such as segmentation, classification, detection, and object counting. By leveraging our large-scale underwater image-text paired dataset without the need for ground-truth annotations, our model enriches existing vision-language models in the aquatic domain. For this purpose, we construct a 2 million underwater image-text paired dataset using heterogeneous resources, including YouTube, Netflix, NatGeo, etc. To fine-tune AquaticCLIP, we propose a prompt-guided vision encoder that progressively aggregates patch features via learnable prompts, while a vision-guided mechanism enhances the language encoder by incorporating visual context. The model is optimized through a contrastive pretraining loss to align visual and textual modalities. AquaticCLIP achieves notable performance improvements in zero-shot settings across multiple underwater computer vision tasks, outperforming existing methods in both robustness and interpretability. Our model sets a new benchmark for vision-language applications in underwater environments. The code and dataset for AquaticCLIP are publicly available on GitHub at xxx.

Via

Access Paper or Ask Questions

GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

Jan 23, 2025

Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S. Khan, Salman Khan

Figure 1 for GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

Figure 2 for GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

Figure 3 for GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

Figure 4 for GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

Abstract:Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models perform poorly for remote sensing (RS). The distinct overhead viewpoint, scale variation, and presence of small objects in high-resolution RS imagery present a unique challenge in region-level comprehension. Moreover, the development of the grounding conversation capability of LMMs within RS is hindered by the lack of granular, RS domain-specific grounded data. Addressing these limitations, we propose GeoPixel - the first end-to-end high resolution RS-LMM that supports pixel-level grounding. This capability allows fine-grained visual perception by generating interleaved masks in conversation. GeoPixel supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis. To support the grounded conversation generation (GCG) in RS imagery, we curate a visually grounded dataset GeoPixelD through a semi-automated pipeline that utilizes set-of-marks prompting and spatial priors tailored for RS data to methodically control the data generation process. GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. Our code and data will be publicly released.

Via

Access Paper or Ask Questions

Admitting Ignorance Helps the Video Question Answering Models to Answer

Jan 15, 2025

Haopeng Li, Tom Drummond, Mingming Gong, Mohammed Bennamoun, Qiuhong Ke

Figure 1 for Admitting Ignorance Helps the Video Question Answering Models to Answer

Figure 2 for Admitting Ignorance Helps the Video Question Answering Models to Answer

Figure 3 for Admitting Ignorance Helps the Video Question Answering Models to Answer

Figure 4 for Admitting Ignorance Helps the Video Question Answering Models to Answer

Abstract:Significant progress has been made in the field of video question answering (VideoQA) thanks to deep learning and large-scale pretraining. Despite the presence of sophisticated model structures and powerful video-text foundation models, most existing methods focus solely on maximizing the correlation between answers and video-question pairs during training. We argue that these models often establish shortcuts, resulting in spurious correlations between questions and answers, especially when the alignment between video and text data is suboptimal. To address these spurious correlations, we propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question, rather than making guesses solely based on superficial question-answer correlations. We introduce methodologies for intervening in questions, utilizing techniques such as displacement and perturbation, and design frameworks for the model to admit its lack of knowledge in both multi-choice VideoQA and open-ended settings. In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness. The results clearly demonstrate that our framework can significantly enhance the performance of VideoQA models with minimal structural modifications.

Via

Access Paper or Ask Questions

BENet: A Cross-domain Robust Network for Detecting Face Forgeries via Bias Expansion and Latent-space Attention

Dec 10, 2024

Weihua Liu, Jianhua Qiu, Said Boumaraf, Chaochao lin, Pan liyuan, Lin Li, Mohammed Bennamoun, Naoufel Werghi

Figure 1 for BENet: A Cross-domain Robust Network for Detecting Face Forgeries via Bias Expansion and Latent-space Attention

Figure 2 for BENet: A Cross-domain Robust Network for Detecting Face Forgeries via Bias Expansion and Latent-space Attention

Figure 3 for BENet: A Cross-domain Robust Network for Detecting Face Forgeries via Bias Expansion and Latent-space Attention

Figure 4 for BENet: A Cross-domain Robust Network for Detecting Face Forgeries via Bias Expansion and Latent-space Attention

Abstract:In response to the growing threat of deepfake technology, we introduce BENet, a Cross-Domain Robust Bias Expansion Network. BENet enhances the detection of fake faces by addressing limitations in current detectors related to variations across different types of fake face generation techniques, where ``cross-domain" refers to the diverse range of these deepfakes, each considered a separate domain. BENet's core feature is a bias expansion module based on autoencoders. This module maintains genuine facial features while enhancing differences in fake reconstructions, creating a reliable bias for detecting fake faces across various deepfake domains. We also introduce a Latent-Space Attention (LSA) module to capture inconsistencies related to fake faces at different scales, ensuring robust defense against advanced deepfake techniques. The enriched LSA feature maps are multiplied with the expanded bias to create a versatile feature space optimized for subtle forgeries detection. To improve its ability to detect fake faces from unknown sources, BENet integrates a cross-domain detector module that enhances recognition accuracy by verifying the facial domain during inference. We train our network end-to-end with a novel bias expansion loss, adopted for the first time, in face forgery detection. Extensive experiments covering both intra and cross-dataset demonstrate BENet's superiority over current state-of-the-art solutions.

Via

Access Paper or Ask Questions

Spiking Neural Networks for Radio Frequency Interference Detection in Radio Astronomy

Dec 09, 2024

Nicholas J. Pritchard, Andreas Wicenec, Mohammed Bennamoun, Richard Dodson

Abstract:Spiking Neural Networks (SNNs) promise efficient spatio-temporal data processing owing to their dynamic nature. This paper addresses a significant challenge in radio astronomy, Radio Frequency Interference (RFI) detection, by reformulating it as a time-series segmentation task inherently suited for SNN execution. Automated RFI detection systems capable of real-time operation with minimal energy consumption are increasingly important in modern radio telescopes. We explore several spectrogram-to-spike encoding methods and network parameters, applying first-order leaky integrate-and-fire SNNs to tackle RFI detection. To enhance the contrast between RFI and background information, we introduce a divisive normalisation-inspired pre-processing step, which improves detection performance across multiple encoding strategies. Our approach achieves competitive performance on a synthetic dataset and compelling results on real data from the Low-Frequency Array (LOFAR) instrument. To our knowledge, this work is the first to train SNNs on real radio astronomy data successfully. These findings highlight the potential of SNNs for performing complex time-series tasks, paving the way for efficient, real-time processing in radio astronomy and other data-intensive fields.

* 27 pages, 5 figures, 5 tables. In-Review

Via

Access Paper or Ask Questions