Head pose estimation is the process of estimating the orientation of a person's head in images or videos.
Cervical dystonia (CD) is the most common form of dystonia, yet current assessment relies on subjective clinical rating scales, such as the Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS), which requires expertise, is subjective and faces low inter-rater reliability some items of the score. To address the lack of established objective tools for monitoring disease severity and treatment response, this study validates an automated image-based head pose and shift estimation system for patients with CD. We developed an assessment tool that combines a pretrained head-pose estimation algorithm for rotational symptoms with a deep learning model trained exclusively on ~16,000 synthetic avatar images to evaluate rare translational symptoms, specifically lateral shift. This synthetic data approach overcomes the scarcity of clinical training examples. The system's performance was validated in a multicenter study by comparing its predicted scores against the consensus ratings of 20 clinical experts using a dataset of 100 real patient images and 100 labeled synthetic avatars. The automated system demonstrated strong agreement with expert clinical ratings for rotational symptoms, achieving high correlations for torticollis (r=0.91), laterocollis (r=0.81), and anteroretrocollis (r=0.78). For lateral shift, the tool achieved a moderate correlation (r=0.55) with clinical ratings and demonstrated higher accuracy than human raters in controlled benchmark tests on avatars. By leveraging synthetic training data to bridge the clinical data gap, this model successfully generalizes to real-world patients, providing a validated, objective tool for CD postural assessment that can enable standardized clinical decision-making and trial evaluation.
Motion transfer from the driving to the source portrait remains a key challenge in the portrait animation. Current diffusion-based approaches condition only on the driving motion, which fails to capture source-to-driving correspondences and consequently yields suboptimal motion transfer. Although flow estimation provides an alternative, predicting dense correspondences from 2D input is ill-posed and often yields inaccurate animation. We address this problem by introducing 3D flows, a learning-free and geometry-driven motion correspondence directly computed from parametric 3D head models. To integrate this 3D prior into diffusion model, we introduce 3D flow encoding to query potential 3D flows for each target pixel to indicate its displacement back to the source location. To obtain 3D flows aligned with 2D motion changes, we further propose depth-guided sampling to accurately locate the corresponding 3D points for each pixel. Beyond high-fidelity portrait animation, our model further supports user-specified editing of facial expression and head pose. Extensive experiments demonstrate the superiority of our method on consistent driving motion transfer as well as faithful source identity preservation.
Relative pose estimation is fundamental for SLAM, visual localization, and 3D reconstruction. Existing Relative Pose Regression (RPR) methods face a key trade-off: feature-matching pipelines achieve high accuracy but block gradient flow via non-differentiable RANSAC, while ViT-based regressors are end-to-end trainable but prohibitively expensive for real-time deployment. We identify the core bottlenecks as the coupling between rotation and translation estimation and insufficient cross-view feature alignment. We propose IUP-Pose, a geometry-driven decoupled iterative framework with implicit dense alignment. A lightweight Multi-Head Bi-Cross Attention (MHBC) module aligns cross-view features without explicit matching supervision. The aligned features are processed by a decoupled rotation-translation pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, and feature maps are realigned via rotational homography H_inf before translation prediction. IUP-Pose achieves 73.3% AUC@20deg on MegaDepth1500 with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating a favorable accuracy-efficiency trade-off for real-time edge deployment.
In this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-n-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end-to-end manner with a manifold-aware network head, taking into account the Riemannian geometry of SPD matrices. Experiments and ablations consistently demonstrate the relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.
Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.
Mounting posture is an important visual indicator of estrus in dairy cattle. However, achieving reliable mounting pose estimation in real-world environments remains challenging due to cluttered backgrounds and frequent inter-animal occlusion. We present FSMC-Pose, a top-down framework that integrates a lightweight frequency-spatial fusion backbone, CattleMountNet, and a multiscale self-calibration head, SC2Head. Specifically, we design two algorithmic components for CattleMountNet: the Spatial Frequency Enhancement Block (SFEBlock) and the Receptive Aggregation Block (RABlock). SFEBlock separates cattle from cluttered backgrounds, while RABlock captures multiscale contextual information. The Spatial-Channel Self-Calibration Head (SC2Head) attends to spatial and channel dependencies and introduces a self-calibration branch to mitigate structural misalignment under inter-animal overlap. We construct a mounting dataset, MOUNT-Cattle, covering 1176 mounting instances, which follows the COCO format and supports drop-in training across pose estimation models. Using a comprehensive dataset that combines MOUNT-Cattle with the public NWAFU-Cattle dataset, FSMC-Pose achieves higher accuracy than strong baselines, with markedly lower computational and parameter costs, while maintaining real-time inference on commodity GPUs. Extensive experiments and qualitative analyses show that FSMC-Pose effectively captures and estimates cattle mounting pose in complex and cluttered environments. Dataset and code are available at https://github.com/elianafang/FSMC-Pose.
Continuous emotion recognition in terms of valence and arousal under in-the-wild (ITW) conditions remains a challenging problem due to large variations in appearance, head pose, illumination, occlusions, and subject-specific patterns of affective expression. We present a multimodal method for valence-arousal estimation ITW. Our method combines three complementary modalities: face, behavior, and audio. The face modality relies on GRADA-based frame-level embeddings and Transformer-based temporal regression. We use Qwen3-VL-4B-Instruct to extract behavior-relevant information from video segments, while Mamba is used to model temporal dynamics across segments. The audio modality relies on WavLM-Large with attention-statistics pooling and includes a cross-modal filtering stage to reduce the influence of unreliable or non-speech segments. To fuse modalities, we explore two fusion strategies: a Directed Cross-Modal Mixture-of-Experts Fusion Strategy that learns interactions between modalities with adaptive weighting, and a Reliability-Aware Audio-Visual Fusion Strategy that combines visual features at the frame-level while using audio as complementary context. The results are reported on the Aff-Wild2 dataset following the 10th Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. Experiments demonstrate that the proposed multimodal fusion strategy achieves a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set.
Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance degradation. To address this issue, we propose a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective. Specifically, we remove bounding-box prediction and redesign the prediction head to better accommodate the high-dimensional structured representations for pose estimation. We further introduce a keypoint-driven dynamic sample assignment strategy to align training objectives with pose evaluation metrics, enabling dense supervision during training and efficient NMS-free inference. In addition, we propose a smooth OKS-based loss function to stabilize optimization in regression-based pose estimation. Based on these designs, we develop a single-stage multi-person pose estimation framework, termed ER-Pose. On MS COCO and CrowdPose, ER-Pose-n achieves AP improvements of 3.2/6.7 without pre-training and 7.4/4.9 with pre-training respectively compared with the baseline YOLO-Pose. These improvements are achieved with fewer parameters and higher inference efficiency.
Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.
Cardiac and lung ultrasound are technically demanding because operators must identify patient-specific intercostal acoustic windows and then navigate between standard views by adjusting probe position, rotation, and force across different imaging planes. These challenges are amplified in teleultrasound when a novice or robot faces the difficult task of first placing the probe on the patient without in-person expert assistance. We present a framework for automating Patient registration and anatomy-informed Initial Probe placement Guidance (PIPG) using only RGB images from a calibrated camera. The novice first captures the patient using the camera on a mixed reality (MR) head-mounted display (HMD). An edge server then infers a patient-specific body-surface and skeleton model, with spatial smoothing across multiple views. Using bony landmarks from the predicted skeleton, we estimate the intercostal region and project the guidance back onto the reconstructed body surface. To validate the framework, we overlaid the reconstructed body mesh and the virtual probe pose guidance across multiple transthoracic echocardiography scan planes in situ and measured the quantitative placement error. Pilot experiments with healthy volunteers suggest that the proposed probe placement prediction and MR guidance yield consistent initial placement within anatomical variability acceptable for teleultrasound setup