Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD-Faith Bench, a diagnostic benchmark based on fine-grained image difference reasoning that enforces explicit visual comparison. Evaluations on state-of-the-art MLLMs reveal two systematic failure modes, perceptual blindness and perception-reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train-free visual evidence-calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at https://github.com/Johanson-colab/SPD-Faith-Bench.
Facial age estimation is critical for content moderation, age verification, and deepfake detection, yet no prior benchmark has systematically compared modern vision-language models (VLMs) against specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating \textbf{34 models} -- 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs -- across \textbf{8 standard datasets} (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, AgeDB) totaling 1{,}100 test images per model. Our key finding is striking: \emph{zero-shot VLMs significantly outperform most specialized models}, achieving an average MAE of 5.65 years compared to 9.88 for non-LLM models. The best VLM (Gemini~3 Flash Preview, MAE~4.32) outperforms the best non-LLM model (MiVOLO, MAE~5.10) by 15\%. Only MiVOLO, which uniquely combines face and body features via Vision Transformers, competes with VLMs. We further analyze age verification at the 18-year threshold, revealing that non-LLM models exhibit 60--100\% false adult rates on minors while VLMs achieve 13--25\%, and demonstrate that coarse age binning (8--9 classes) consistently degrades MAE beyond 13 years. Our stratified analysis across 14 age groups reveals that all models struggle most at extreme ages ($<$5 and 65+). These findings challenge the assumption that task-specific architectures are necessary for age estimation and suggest that the field should redirect toward distilling VLM capabilities into efficient specialized models.
Vision-Language Models (VLMs) are now a core part of modern AI. Recent work proposed several visual jailbreak attacks using single/ holistic images. However, contemporary VLMs demonstrate strong robustness against such attacks due to extensive safety alignment through preference optimization (e.g., RLHF). In this work, we identify a new vulnerability: while VLM pretraining and instruction tuning generalize well to split-image inputs, safety alignment is typically performed only on holistic images and does not account for harmful semantics distributed across multiple image fragments. Consequently, VLMs often fail to detect and refuse harmful split-image inputs, where unsafe cues emerge only after combining images. We introduce novel split-image visual jailbreak attacks (SIVA) that exploit this misalignment. Unlike prior optimization-based attacks, which exhibit poor black-box transferability due to architectural and prior mismatches across models, our attacks evolve in progressive phases from naive splitting to an adaptive white-box attack, culminating in a black-box transfer attack. Our strongest strategy leverages a novel adversarial knowledge distillation (Adv-KD) algorithm to substantially improve cross-model transferability. Evaluations on three state-of-the-art modern VLMs and three jailbreak datasets demonstrate that our strongest attack achieves up to 60% higher transfer success than existing baselines. Lastly, we propose efficient ways to address this critical vulnerability in the current VLM safety alignment.
Spatial intelligence is crucial for vision--language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely used VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Encouraging models to think yields only marginal gains, and error analysis points to failures in structural grounding and constraint-consistent 3D reasoning. Project page: https://ssi-bench.github.io.
Visual Language Models (VLMs) are often used for zero-shot detection of visual attributes in the image. We present a zero-shot evaluation of open-source VLMs for privacy-related attribute recognition. We identify the attributes for which VLMs exhibit strong inter-annotator agreement, and discuss the disagreement cases of human and VLM annotations. Our results show that when evaluated against human annotations, VLMs tend to predict the presence of privacy attributes more often than human annotators. In addition to this, we find that in cases of high inter-annotator agreement between VLMs, they can complement human annotation by identifying attributes overlooked by human annotators. This highlights the potential of VLMs to support privacy annotations in large-scale image datasets.
Camera pose estimation from sparse correspondences is a fundamental problem in geometric computer vision and remains particularly challenging in near-field scenarios, where strong perspective effects and heterogeneous measurement noise can significantly degrade the stability of analytic PnP solutions. In this paper, we present a geometric error propagation framework for camera pose estimation based on a parallel perspective approximation. By explicitly modeling how image measurement errors propagate through perspective geometry, we derive an error transfer model that characterizes the relationship between feature point distribution, camera depth, and pose estimation uncertainty. Building on this analysis, we develop a pose estimation method that leverages parallel perspective initialization and error-aware weighting within a Gauss-Newton optimization scheme, leading to improved robustness in proximity operations. Extensive experiments on both synthetic data and real-world images, covering diverse conditions such as strong illumination, surgical lighting, and underwater low-light environments, demonstrate that the proposed approach achieves accuracy and robustness comparable to state-of-the-art analytic and iterative PnP methods, while maintaining high computational efficiency. These results highlight the importance of explicit geometric error modeling for reliable camera pose estimation in challenging near-field settings.
Image guided robotic navigation systems often rely on reference based geometric perception pipelines, where accurate spatial mapping is established through multi stage estimation processes. In biplanar X ray guided navigation, such pipelines are widely used due to their real time capability and geometric interpretability. However, navigation reliability can be constrained by an overlooked system level failure mechanism in which installation induced structural perturbations introduced at the perception stage are progressively amplified along the perception reconstruction execution chain and dominate execution level error and tail risk behavior. This paper investigates this mechanism from a system level perspective and presents a unified error propagation modeling framework that characterizes how installation induced structural perturbations propagate and couple with pixel level observation noise through biplanar imaging, projection matrix estimation, triangulation, and coordinate mapping. Using first order analytic uncertainty propagation and Monte Carlo simulations, we analyze dominant sensitivity channels and quantify worst case error behavior beyond mean accuracy metrics. The results show that rotational installation error is a primary driver of system level error amplification, while translational misalignment of comparable magnitude plays a secondary role under typical biplanar geometries. Real biplanar X ray bench top experiments further confirm that the predicted amplification trends persist under realistic imaging conditions. These findings reveal a broader structural limitation of reference based multi stage geometric perception pipelines and provide a framework for system level reliability analysis and risk aware design in safety critical robotic navigation systems.
Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.
Simultaneous multi-slice (SMS) imaging with in-plane undersampling enables highly accelerated MRI but yields a strongly coupled inverse problem with deterministic inter-slice interference and missing k-space data. Most diffusion-based reconstructions are formulated around Gaussian-noise corruption and rely on additional consistency steps to incorporate SMS physics, which can be mismatched to the operator-governed degradations in SMS acquisition. We propose an operator-guided framework that models the degradation trajectory using known acquisition operators and inverts this process via deterministic updates. Within this framework, we introduce an operator-conditional dual-stream interaction network (OCDI-Net) that explicitly disentangles target-slice content from inter-slice interference and predicts structured degradations for operator-aligned inversion, and we instantiate reconstruction as a two-stage chained inference procedure that performs SMS slice separation followed by in-plane completion. Experiments on fastMRI brain data and prospectively acquired in vivo diffusion MRI data demonstrate improved fidelity and reduced slice leakage over conventional and learning-based SMS reconstructions.
Video generation with controllable camera viewpoints is essential for applications such as interactive content creation, gaming, and simulation. Existing methods typically adapt pre-trained video models using camera poses relative to a fixed reference, e.g., the first frame. However, these encodings lack shift-invariance, often leading to poor generalization and accumulated drift. While relative camera pose embeddings defined between arbitrary view pairs offer a more robust alternative, integrating them into pre-trained video diffusion models without prohibitive training costs or architectural changes remains challenging. We introduce ReRoPE, a plug-and-play framework that incorporates relative camera information into pre-trained video diffusion models without compromising their generation capability. Our approach is based on the insight that Rotary Positional Embeddings (RoPE) in existing models underutilize their full spectral bandwidth, particularly in the low-frequency components. By seamlessly injecting relative camera pose information into these underutilized bands, ReRoPE achieves precise control while preserving strong pre-trained generative priors. We evaluate our method on both image-to-video (I2V) and video-to-video (V2V) tasks in terms of camera control accuracy and visual fidelity. Our results demonstrate that ReRoPE offers a training-efficient path toward controllable, high-fidelity video generation. See project page for more results: https://sisyphe-lee.github.io/ReRoPE/