Abstract:Jaguar re-identification (re-ID) from citizen-science imagery can look strong on standard retrieval metrics while still relying on the wrong evidence, such as background context or silhouette shape, instead of the coat pattern that defines identity. We introduce a diagnostic framework for wildlife re-ID with two axes: a leakage-controlled context ratio, background/foreground, computed from inpainted background-only versus foreground-only images, and a laterality diagnostic based on cross-flank retrieval and mirror self-similarity. To make these diagnostics measurable, we curate a Pantanal jaguar benchmark with per-pixel segmentation masks and an identity-balanced evaluation protocol. We then use representative mitigation families, ArcFace fine-tuning, anti-symmetry regularization, and Lorentz hyperbolic embeddings, as case studies under the same evaluation lens. The goal is not only to ask which model ranks best, but also what visual evidence it uses to do so.




Abstract:Monitoring critically endangered western lowland gorillas is currently hampered by the immense manual effort required to re-identify individuals from vast archives of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, "in-the-wild" video datasets suitable for training robust deep learning models. To address this gap, we introduce a comprehensive benchmark with three novel datasets: Gorilla-SPAC-Wild, the largest video dataset for wild primate re-identification to date; Gorilla-Berlin-Zoo, for assessing cross-domain re-identification generalization; and Gorilla-SPAC-MoT, for evaluating multi-object tracking in camera trap footage. Building on these datasets, we present GorillaWatch, an end-to-end pipeline integrating detection, tracking, and re-identification. To exploit temporal information, we introduce a multi-frame self-supervised pretraining strategy that leverages consistency in tracklets to learn domain-specific features without manual labels. To ensure scientific validity, a differentiable adaptation of AttnLRP verifies that our model relies on discriminative biometric traits rather than background correlations. Extensive benchmarking subsequently demonstrates that aggregating features from large-scale image backbones outperforms specialized video architectures. Finally, we address unsupervised population counting by integrating spatiotemporal constraints into standard clustering to mitigate over-segmentation. We publicly release all code and datasets to facilitate scalable, non-invasive monitoring of endangered species