Alert button
Picture for Pablo Arbeláez

Pablo Arbeláez

Alert button

Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images

Nov 02, 2023
Zalan Fabian, Zhongqi Miao, Chunyuan Li, Yuanhan Zhang, Ziwei Liu, Andrés Hernández, Andrés Montes-Rojas, Rafael Escucha, Laura Siabatto, Andrés Link, Pablo Arbeláez, Rahul Dodhia, Juan Lavista Ferres

Due to deteriorating environmental conditions and increasing human activity, conservation efforts directed towards wildlife is crucial. Motion-activated camera traps constitute an efficient tool for tracking and monitoring wildlife populations across the globe. Supervised learning techniques have been successfully deployed to analyze such imagery, however training such techniques requires annotations from experts. Reducing the reliance on costly labelled data therefore has immense potential in developing large-scale wildlife tracking solutions with markedly less human labor. In this work we propose WildMatch, a novel zero-shot species classification framework that leverages multimodal foundation models. In particular, we instruction tune vision-language models to generate detailed visual descriptions of camera trap images using similar terminology to experts. Then, we match the generated caption to an external knowledge base of descriptions in order to determine the species in a zero-shot manner. We investigate techniques to build instruction tuning datasets for detailed animal description generation and propose a novel knowledge augmentation technique to enhance caption quality. We demonstrate the performance of WildMatch on a new camera trap dataset collected in the Magdalena Medio region of Colombia.

* 18 pages, 9 figures 
Viaarxiv icon

SEPAL: Spatial Gene Expression Prediction from Local Graphs

Sep 16, 2023
Gabriel Mejia, Paula Cárdenas, Daniela Ruiz, Angela Castillo, Pablo Arbeláez

Spatial transcriptomics is an emerging technology that aligns histopathology images with spatially resolved gene expression profiling. It holds the potential for understanding many diseases but faces significant bottlenecks such as specialized equipment and domain expertise. In this work, we present SEPAL, a new model for predicting genetic profiles from visual tissue appearance. Our method exploits the biological biases of the problem by directly supervising relative differences with respect to mean expression, and leverages local visual context at every coordinate to make predictions using a graph neural network. This approach closes the gap between complete locality and complete globality in current methods. In addition, we propose a novel benchmark that aims to better define the task by following current best practices in transcriptomics and restricting the prediction variables to only those with clear spatial patterns. Our extensive evaluation in two different human breast cancer datasets indicates that SEPAL outperforms previous state-of-the-art methods and other mechanisms of including spatial context.

Viaarxiv icon

STRIDE: Street View-based Environmental Feature Detection and Pedestrian Collision Prediction

Aug 25, 2023
Cristina González, Nicolás Ayobi, Felipe Escallón, Laura Baldovino-Chiquillo, Maria Wilches-Mogollón, Donny Pasos, Nicole Ramírez, Jose Pinzón, Olga Sarmiento, D Alex Quistberg, Pablo Arbeláez

Figure 1 for STRIDE: Street View-based Environmental Feature Detection and Pedestrian Collision Prediction
Figure 2 for STRIDE: Street View-based Environmental Feature Detection and Pedestrian Collision Prediction
Figure 3 for STRIDE: Street View-based Environmental Feature Detection and Pedestrian Collision Prediction
Figure 4 for STRIDE: Street View-based Environmental Feature Detection and Pedestrian Collision Prediction

This paper introduces a novel benchmark to study the impact and relationship of built environment elements on pedestrian collision prediction, intending to enhance environmental awareness in autonomous driving systems to prevent pedestrian injuries actively. We introduce a built environment detection task in large-scale panoramic images and a detection-based pedestrian collision frequency prediction task. We propose a baseline method that incorporates a collision prediction module into a state-of-the-art detection model to tackle both tasks simultaneously. Our experiments demonstrate a significant correlation between object detection of built environment elements and pedestrian collision frequency prediction. Our results are a stepping stone towards understanding the interdependencies between built environment conditions and pedestrian safety.

Viaarxiv icon

Guarding the Guardians: Automated Analysis of Online Child Sexual Abuse

Aug 10, 2023
Juanita Puentes, Angela Castillo, Wilmar Osejo, Yuly Calderón, Viviana Quintero, Lina Saldarriaga, Diana Agudelo, Pablo Arbeláez

Online violence against children has increased globally recently, demanding urgent attention. Competent authorities manually analyze abuse complaints to comprehend crime dynamics and identify patterns. However, the manual analysis of these complaints presents a challenge because it exposes analysts to harmful content during the review process. Given these challenges, we present a novel solution, an automated tool designed to analyze children's sexual abuse reports comprehensively. By automating the analysis process, our tool significantly reduces the risk of exposure to harmful content by categorizing the reports on three dimensions: Subject, Degree of Criminality, and Damage. Furthermore, leveraging our multidisciplinary team's expertise, we introduce a novel approach to annotate the collected data, enabling a more in-depth analysis of the reports. This approach improves the comprehension of fundamental patterns and trends, enabling law enforcement agencies and policymakers to create focused strategies in the fight against children's violence.

* Artificial Intelligence (AI) and Humanitarian Assistance and Disaster Recovery (HADR) workshop, ICCV 2023 in Paris, France 
Viaarxiv icon

EgoCOL: Egocentric Camera pose estimation for Open-world 3D object Localization @Ego4D challenge 2023

Jun 29, 2023
Cristhian Forigua, Maria Escobar, Jordi Pont-Tuset, Kevis-Kokitsi Maninis, Pablo Arbeláez

Figure 1 for EgoCOL: Egocentric Camera pose estimation for Open-world 3D object Localization @Ego4D challenge 2023
Figure 2 for EgoCOL: Egocentric Camera pose estimation for Open-world 3D object Localization @Ego4D challenge 2023
Figure 3 for EgoCOL: Egocentric Camera pose estimation for Open-world 3D object Localization @Ego4D challenge 2023
Figure 4 for EgoCOL: Egocentric Camera pose estimation for Open-world 3D object Localization @Ego4D challenge 2023

We present EgoCOL, an egocentric camera pose estimation method for open-world 3D object localization. Our method leverages sparse camera pose reconstructions in a two-fold manner, video and scan independently, to estimate the camera pose of egocentric frames in 3D renders with high recall and precision. We extensively evaluate our method on the Visual Query (VQ) 3D object localization Ego4D benchmark. EgoCOL can estimate 62% and 59% more camera poses than the Ego4D baseline in the Ego4D Visual Queries 3D Localization challenge at CVPR 2023 in the val and test sets, respectively. Our code is publicly available at https://github.com/BCV-Uniandes/EgoCOL

Viaarxiv icon

BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion Synthesis

Apr 21, 2023
Angela Castillo, Maria Escobar, Guillaume Jeanneret, Albert Pumarola, Pablo Arbeláez, Ali Thabet, Artsiom Sanakoyeu

Figure 1 for BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion Synthesis
Figure 2 for BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion Synthesis
Figure 3 for BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion Synthesis
Figure 4 for BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion Synthesis

Mixed reality applications require tracking the user's full-body motion to enable an immersive experience. However, typical head-mounted devices can only track head and hand movements, leading to a limited reconstruction of full-body motion due to variability in lower body configurations. We propose BoDiffusion -- a generative diffusion model for motion synthesis to tackle this under-constrained reconstruction problem. We present a time and space conditioning scheme that allows BoDiffusion to leverage sparse tracking inputs while generating smooth and realistic full-body motion sequences. To the best of our knowledge, this is the first approach that uses the reverse diffusion process to model full-body tracking as a conditional sequence generation task. We conduct experiments on the large-scale motion-capture dataset AMASS and show that our approach outperforms the state-of-the-art approaches by a significant margin in terms of full-body motion realism and joint reconstruction error.

Viaarxiv icon

JoB-VS: Joint Brain-Vessel Segmentation in TOF-MRA Images

Apr 16, 2023
Natalia Valderrama, Ioannis Pitsiorlas, Luisa Vargas, Pablo Arbeláez, Maria A. Zuluaga

Figure 1 for JoB-VS: Joint Brain-Vessel Segmentation in TOF-MRA Images
Figure 2 for JoB-VS: Joint Brain-Vessel Segmentation in TOF-MRA Images
Figure 3 for JoB-VS: Joint Brain-Vessel Segmentation in TOF-MRA Images
Figure 4 for JoB-VS: Joint Brain-Vessel Segmentation in TOF-MRA Images

We propose the first joint-task learning framework for brain and vessel segmentation (JoB-VS) from Time-of-Flight Magnetic Resonance images. Unlike state-of-the-art vessel segmentation methods, our approach avoids the pre-processing step of implementing a model to extract the brain from the volumetric input data. Skipping this additional step makes our method an end-to-end vessel segmentation framework. JoB-VS uses a lattice architecture that favors the segmentation of structures of different scales (e.g., the brain and vessels). Its segmentation head allows the simultaneous prediction of the brain and vessel mask. Moreover, we generate data augmentation with adversarial examples, which our results demonstrate to enhance the performance. JoB-VS achieves 70.03% mean AP and 69.09% F1-score in the OASIS-3 dataset and is capable of generalizing the segmentation in the IXI dataset. These results show the adequacy of JoB-VS for the challenging task of vessel segmentation in complete TOF-MRA images.

Viaarxiv icon

MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation

Mar 19, 2023
Nicolás Ayobi, Alejandra Pérez-Rondón, Santiago Rodríguez, Pablo Arbeláez

Figure 1 for MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation
Figure 2 for MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation
Figure 3 for MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation
Figure 4 for MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation

We propose Masked-Attention Transformers for Surgical Instrument Segmentation (MATIS), a two-stage, fully transformer-based method that leverages modern pixel-wise attention mechanisms for instrument segmentation. MATIS exploits the instance-level nature of the task by employing a masked attention module that generates and classifies a set of fine instrument region proposals. Our method incorporates long-term video-level information through video transformers to improve temporal consistency and enhance mask classification. We validate our approach in the two standard public benchmarks, Endovis 2017 and Endovis 2018. Our experiments demonstrate that MATIS' per-frame baseline outperforms previous state-of-the-art methods and that including our temporal consistency module boosts our model's performance further.

Viaarxiv icon

Towards Holistic Surgical Scene Understanding

Dec 13, 2022
Natalia Valderrama, Paola Ruiz Puentes, Isabela Hernández, Nicolás Ayobi, Mathilde Verlyk, Jessica Santander, Juan Caicedo, Nicolás Fernández, Pablo Arbeláez

Most benchmarks for studying surgical interventions focus on a specific challenge instead of leveraging the intrinsic complementarity among different tasks. In this work, we present a new experimental framework towards holistic surgical scene understanding. First, we introduce the Phase, Step, Instrument, and Atomic Visual Action recognition (PSI-AVA) Dataset. PSI-AVA includes annotations for both long-term (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomy videos. Second, we present Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR) as a strong baseline for surgical scene understanding. TAPIR leverages our dataset's multi-level annotations as it benefits from the learned representation on the instrument detection task to improve its classification capacity. Our experimental results in both PSI-AVA and other publicly available databases demonstrate the adequacy of our framework to spur future research on holistic surgical scene understanding.

* MICCAI 2022 Oral 
Viaarxiv icon