Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hakan Bilen

PAOLI: Pose-free Articulated Object Learning from Sparse-view Images

Sep 04, 2025

Jianning Deng, Kartic Subr, Hakan Bilen

Abstract:We present a novel self-supervised framework for learning articulated object representations from sparse-view, unposed images. Unlike prior methods that require dense multi-view observations and ground-truth camera poses, our approach operates with as few as four views per articulation and no camera supervision. To address the inherent challenges, we first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we jointly optimize geometry, appearance, and kinematics with a self-supervised loss that enforces cross-view and cross-pose consistency. Experiments on the standard benchmark and real-world examples demonstrate that our method produces accurate and detailed articulated object representations under significantly weaker input assumptions than existing approaches.

Via

Access Paper or Ask Questions

Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence

Jun 09, 2025

Octave Mariotti, Zhipeng Du, Yash Bhalgat, Oisin Mac Aodha, Hakan Bilen

Abstract:Semantic correspondence (SC) aims to establish semantically meaningful matches across different instances of an object category. We illustrate how recent supervised SC methods remain limited in their ability to generalize beyond sparsely annotated training keypoints, effectively acting as keypoint detectors. To address this, we propose a novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation. Our method constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations. Additionally, we introduce SPair-U, an extension of SPair-71k with novel keypoint annotations, to better assess generalization. Experiments not only demonstrate that our model significantly outperforms supervised baselines on unseen keypoints, highlighting its effectiveness in learning robust correspondences, but that unsupervised baselines outperform supervised counterparts when generalized across different datasets.

Via

Access Paper or Ask Questions

DD-Ranking: Rethinking the Evaluation of Dataset Distillation

May 19, 2025

Zekai Li, Xinhao Zhong, Samir Khaki, Zhiyuan Liang, Yuhao Zhou, Mingjia Shi, Ziqiao Wang, Xuanlei Zhao, Wangbo Zhao, Ziheng Qin(+42 more)

Abstract:In recent years, dataset distillation has provided a reliable solution for data compression, where models trained on the resulting smaller synthetic datasets achieve performance comparable to those trained on the original datasets. To further improve the performance of synthetic datasets, various training pipelines and optimization objectives have been proposed, greatly advancing the field of dataset distillation. Recent decoupled dataset distillation methods introduce soft labels and stronger data augmentation during the post-evaluation phase and scale dataset distillation up to larger datasets (e.g., ImageNet-1K). However, this raises a question: Is accuracy still a reliable metric to fairly evaluate dataset distillation methods? Our empirical findings suggest that the performance improvements of these methods often stem from additional techniques rather than the inherent quality of the images themselves, with even randomly sampled images achieving superior results. Such misaligned evaluation settings severely hinder the development of DD. Therefore, we propose DD-Ranking, a unified evaluation framework, along with new general evaluation metrics to uncover the true performance improvements achieved by different methods. By refocusing on the actual information enhancement of distilled datasets, DD-Ranking provides a more comprehensive and fair evaluation standard for future research advancements.

* 20 pages, 4 figures

Via

Access Paper or Ask Questions

Visually Interpretable Subtask Reasoning for Visual Question Answering

May 12, 2025

Yu Cheng, Arushi Goel, Hakan Bilen

Figure 1 for Visually Interpretable Subtask Reasoning for Visual Question Answering

Figure 2 for Visually Interpretable Subtask Reasoning for Visual Question Answering

Figure 3 for Visually Interpretable Subtask Reasoning for Visual Question Answering

Figure 4 for Visually Interpretable Subtask Reasoning for Visual Question Answering

Abstract:Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.

Via

Access Paper or Ask Questions

HumMorph: Generalized Dynamic Human Neural Fields from Few Views

Apr 27, 2025

Jakub Zadrożny, Hakan Bilen

Figure 1 for HumMorph: Generalized Dynamic Human Neural Fields from Few Views

Figure 2 for HumMorph: Generalized Dynamic Human Neural Fields from Few Views

Figure 3 for HumMorph: Generalized Dynamic Human Neural Fields from Few Views

Figure 4 for HumMorph: Generalized Dynamic Human Neural Fields from Few Views

Abstract:We introduce HumMorph, a novel generalized approach to free-viewpoint rendering of dynamic human bodies with explicit pose control. HumMorph renders a human actor in any specified pose given a few observed views (starting from just one) in arbitrary poses. Our method enables fast inference as it relies only on feed-forward passes through the model. We first construct a coarse representation of the actor in the canonical T-pose, which combines visual features from individual partial observations and fills missing information using learned prior knowledge. The coarse representation is complemented by fine-grained pixel-aligned features extracted directly from the observed views, which provide high-resolution appearance information. We show that HumMorph is competitive with the state-of-the-art when only a single input view is available, however, we achieve results with significantly better visual quality given just 2 monocular observations. Moreover, previous generalized methods assume access to accurate body shape and pose parameters obtained using synchronized multi-camera setups. In contrast, we consider a more practical scenario where these body parameters are noisily estimated directly from the observed views. Our experimental results demonstrate that our architecture is more robust to errors in the noisy parameters and clearly outperforms the state of the art in this setting.

* Project page: https://jakubzadrozny.github.io/hummorph

Via

Access Paper or Ask Questions

Multiple Instance Learning with Coarse-to-Fine Self-Distillation

Feb 04, 2025

Shuyang Wu, Yifu Qiu, Ines P. Nearchou, Sandrine Prost, Jonathan A. Fallowfield, Hakan Bilen, Timothy J. Kendall

Figure 1 for Multiple Instance Learning with Coarse-to-Fine Self-Distillation

Figure 2 for Multiple Instance Learning with Coarse-to-Fine Self-Distillation

Figure 3 for Multiple Instance Learning with Coarse-to-Fine Self-Distillation

Figure 4 for Multiple Instance Learning with Coarse-to-Fine Self-Distillation

Abstract:Multiple Instance Learning (MIL) for whole slide image (WSI) analysis in computational pathology often neglects instance-level learning as supervision is typically provided only at the bag level. In this work, we present PathMIL, a framework designed to improve MIL through two perspectives: (1) employing instance-level supervision and (2) learning inter-instance contextual information on bag level. Firstly, we propose a novel Coarse-to-Fine Self-Distillation (CFSD) paradigm, to probe and distil a classifier trained with bag-level information to obtain instance-level labels which could effectively provide the supervision for the same classifier in a finer way. Secondly, to capture inter-instance contextual information in WSI, we propose Two-Dimensional Positional Encoding (2DPE), which encodes the spatial appearance of instances within a bag. We also theoretically and empirically prove the instance-level learnability of CFSD. PathMIL is evaluated on multiple benchmarking tasks, including subtype classification (TCGA-NSCLC), tumour classification (CAMELYON16), and an internal benchmark for breast cancer receptor status classification. Our method achieves state-of-the-art performance, with AUC scores of 0.9152 and 0.8524 for estrogen and progesterone receptor status classification, respectively, an AUC of 0.9618 for subtype classification, and 0.8634 for tumour classification, surpassing existing methods.

Via

Access Paper or Ask Questions

Spatially-Adaptive Hash Encodings For Neural Surface Reconstruction

Dec 06, 2024

Thomas Walker, Octave Mariotti, Amir Vaxman, Hakan Bilen

Abstract:Positional encodings are a common component of neural scene reconstruction methods, and provide a way to bias the learning of neural fields towards coarser or finer representations. Current neural surface reconstruction methods use a "one-size-fits-all" approach to encoding, choosing a fixed set of encoding functions, and therefore bias, across all scenes. Current state-of-the-art surface reconstruction approaches leverage grid-based multi-resolution hash encoding in order to recover high-detail geometry. We propose a learned approach which allows the network to choose its encoding basis as a function of space, by masking the contribution of features stored at separate grid resolutions. The resulting spatially adaptive approach allows the network to fit a wider range of frequencies without introducing noise. We test our approach on standard benchmark surface reconstruction datasets and achieve state-of-the-art performance on two benchmark datasets.

Via

Access Paper or Ask Questions

DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

Nov 26, 2024

Duolikun Danier, Mehmet Aygün, Changjian Li, Hakan Bilen, Oisin Mac Aodha

Figure 1 for DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

Figure 2 for DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

Figure 3 for DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

Figure 4 for DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

Abstract:Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.

* Website: https://danier97.github.io/depthcues/

Via

Access Paper or Ask Questions

InstanSeg: an embedding-based instance segmentation algorithm optimized for accurate, efficient and portable cell segmentation

Aug 28, 2024

Thibaut Goldsborough, Ben Philps, Alan O'Callaghan, Fiona Inglis, Leo Leplat, Andrew Filby, Hakan Bilen, Peter Bankhead

Figure 1 for InstanSeg: an embedding-based instance segmentation algorithm optimized for accurate, efficient and portable cell segmentation

Figure 2 for InstanSeg: an embedding-based instance segmentation algorithm optimized for accurate, efficient and portable cell segmentation

Figure 3 for InstanSeg: an embedding-based instance segmentation algorithm optimized for accurate, efficient and portable cell segmentation

Figure 4 for InstanSeg: an embedding-based instance segmentation algorithm optimized for accurate, efficient and portable cell segmentation

Abstract:Cell and nucleus segmentation are fundamental tasks for quantitative bioimage analysis. Despite progress in recent years, biologists and other domain experts still require novel algorithms to handle increasingly large and complex real-world datasets. These algorithms must not only achieve state-of-the-art accuracy, but also be optimized for efficiency, portability and user-friendliness. Here, we introduce InstanSeg: a novel embedding-based instance segmentation pipeline designed to identify cells and nuclei in microscopy images. Using six public cell segmentation datasets, we demonstrate that InstanSeg can significantly improve accuracy when compared to the most widely used alternative methods, while reducing the processing time by at least 60%. Furthermore, InstanSeg is designed to be fully serializable as TorchScript and supports GPU acceleration on a range of hardware. We provide an open-source implementation of InstanSeg in Python, in addition to a user-friendly, interactive QuPath extension for inference written in Java. Our code and pre-trained models are available at https://github.com/instanseg/instanseg .

* 12 pages,6 figures

Via

Access Paper or Ask Questions

Odd-One-Out: Anomaly Detection by Comparing with Neighbors

Jun 28, 2024

Ankan Bhunia, Changjian Li, Hakan Bilen

Figure 1 for Odd-One-Out: Anomaly Detection by Comparing with Neighbors

Figure 2 for Odd-One-Out: Anomaly Detection by Comparing with Neighbors

Figure 3 for Odd-One-Out: Anomaly Detection by Comparing with Neighbors

Figure 4 for Odd-One-Out: Anomaly Detection by Comparing with Neighbors

Abstract:This paper introduces a novel anomaly detection (AD) problem that focuses on identifying `odd-looking' objects relative to the other instances within a scene. Unlike the traditional AD benchmarks, in our setting, anomalies in this context are scene-specific, defined by the regular instances that make up the majority. Since object instances are often partly visible from a single viewpoint, our setting provides multiple views of each scene as input. To provide a testbed for future research in this task, we introduce two benchmarks, ToysAD-8K and PartsAD-15K. We propose a novel method that generates 3D object-centric representations for each instance and detects the anomalous ones through a cross-examination between the instances. We rigorously analyze our method quantitatively and qualitatively in the presented benchmarks.

* Codes & Dataset at https://github.com/VICO-UoE/OddOneOutAD

Via

Access Paper or Ask Questions