Detecting deepfakes has become increasingly challenging as forgery faces synthesized by AI-generated methods, particularly diffusion models, achieve unprecedented quality and resolution. Existing forgery detection approaches relying on spatial and frequency features demonstrate limited efficacy against high-quality, entirely synthesized forgeries. In this paper, we propose a novel detection method grounded in the observation that facial attributes governed by complex physical laws and multiple parameters are inherently difficult to replicate. Specifically, we focus on illumination, particularly the specular reflection component in the Phong illumination model, which poses the greatest replication challenge due to its parametric complexity and nonlinear formulation. We introduce a fast and accurate face texture estimation method based on Retinex theory to enable precise specular reflection separation. Furthermore, drawing from the mathematical formulation of specular reflection, we posit that forgery evidence manifests not only in the specular reflection itself but also in its relationship with corresponding face texture and direct light. To address this issue, we design the Specular-Reflection-Inconsistency-Network (SRI-Net), incorporating a two-stage cross-attention mechanism to capture these correlations and integrate specular reflection related features with image features for robust forgery detection. Experimental results demonstrate that our method achieves superior performance on both traditional deepfake datasets and generative deepfake datasets, particularly those containing diffusion-generated forgery faces.
Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor'' turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.
Surgical navigation based on multimodal image registration has played a significant role in providing intraoperative guidance to surgeons by showing the relative position of the target area to critical anatomical structures during surgery. However, due to the differences between multimodal images and intraoperative image deformation caused by tissue displacement and removal during the surgery, effective registration of preoperative and intraoperative multimodal images faces significant challenges. To address the multimodal image registration challenges in Learn2Reg 2025, an unsupervised multimodal medical image registration method based on multilevel correlation pyramidal optimization (MCPO) is designed to solve these problems. First, the features of each modality are extracted based on the modality independent neighborhood descriptor, and the multimodal images is mapped to the feature space. Second, a multilevel pyramidal fusion optimization mechanism is designed to achieve global optimization and local detail complementation of the displacement field through dense correlation analysis and weight-balanced coupled convex optimization for input features at different scales. Our method focuses on the ReMIND2Reg task in Learn2Reg 2025. Based on the results, our method achieved the first place in the validation phase and test phase of ReMIND2Reg. The MCPO is also validated on the Resect dataset, achieving an average TRE of 1.798 mm. This demonstrates the broad applicability of our method in preoperative-to-intraoperative image registration. The code is avaliable at https://github.com/wjiazheng/MCPO.
Kabuki syndrome (KS) and Wiedemann-Steiner syndrome (WSS) are rare but distinct developmental disorders that share overlapping clinical features, including neurodevelopmental delay, growth restriction, and persistent fetal fingertip pads. While genetic testing remains the diagnostic gold standard, many individuals with KS or WSS remain undiagnosed due to barriers in access to both genetic testing and expertise. Dermatoglyphic anomalies, despite being established hallmarks of several genetic syndromes, remain an underutilized diagnostic signal in the era of molecular testing. This study presents a vision transformer-based deep learning model that leverages fingerprint images to distinguish individuals with KS and WSS from unaffected controls and from one another. We evaluate model performance across three binary classification tasks. Across the three classification tasks, the model achieved AUC scores of 0.80 (control vs. KS), 0.73 (control vs. WSS), and 0.85 (KS vs. WSS), with corresponding F1 scores of 0.71, 0.72, and 0.83, respectively. Beyond classification, we apply attention-based visualizations to identify fingerprint regions most salient to model predictions, enhancing interpretability. Together, these findings suggest the presence of syndrome-specific fingerprint features, demonstrating the feasibility of a fingerprint-based artificial intelligence (AI) tool as a noninvasive, interpretable, and accessible future diagnostic aid for the early diagnosis of underdiagnosed genetic syndromes.
Image-based patient-specific simulation of left ventricular (LV) mechanics is valuable for understanding cardiac function and supporting clinical intervention planning, but conventional finite-element analysis (FEA) is computationally intensive. Current graph-based surrogates do not have full-cycle prediction capabilities, and physics-informed neural networks often struggle to converge on complex cardiac geometries. We present CardioGraphFENet (CGFENet), a unified graph-based surrogate for rapid full-cycle estimation of LV myocardial biomechanics, supervised by a large FEA simulation dataset. The proposed model integrates (i) a global--local graph encoder to capture mesh features with weak-form-inspired global coupling, (ii) a gated recurrent unit-based temporal encoder conditioned on the target volume-time signal to model cycle-coherent dynamics, and (iii) a cycle-consistent bidirectional formulation for both loading and inverse unloading within a single framework. These strategies enable high fidelity with respect to traditional FEA ground truths and produce physiologically plausible pressure-volume loops that match FEA results when coupled with a lumped-parameter model. In particular, the cycle-consistency strategy enables a significant reduction in FEA supervision with only minimal loss in accuracy.
Ultrasound imaging is widely used in noninvasive medical diagnostics due to its efficiency, portability, and avoidance of ionizing radiation. However, its utility is limited by the quality of the signal. Signal-dependent speckle noise, signal-independent sensor noise, and non-uniform spatial blurring caused by the transducer and modeled by the point spread function (PSF) degrade the image quality. These degradations challenge conventional image restoration methods, which assume simplified noise models, and highlight the need for specialized algorithms capable of effectively reducing the degradations while preserving fine structural details. We propose DEMIX, a novel dual-encoder denoising framework with a masked gated fusion mechanism, for denoising ultrasound images degraded by mixed noise and further degraded by PSF-induced distortions. DEMIX is inspired by diffusion models and is characterized by a forward process and a deterministic reverse process. DEMIX adaptively assesses the different noise components, disentangles them in the latent space, and suppresses these components while compensating for PSF degradations. Extensive experiments on two ultrasound datasets, along with a downstream segmentation task, demonstrate that DEMIX consistently outperforms state-of-the-art baselines, achieving superior noise suppression and preserving structural details. The code will be made publicly available.
Techniques for feedforward networks (FFNs) and convolutional networks (CNNs) are frequently reused across families, but the relationship between the underlying model classes is rarely made explicit. We introduce a unified node-level formalization with tensor-valued activations and show that generalized feedforward networks form a strict subset of generalized convolutional networks. Motivated by the mismatch in per-input parameterization between the two families, we propose model projection, a parameter-efficient transfer learning method for CNNs that freezes pretrained per-input-channel filters and learns a single scalar gate for each (output channel, input channel) contribution. Projection keeps all convolutional layers adaptable to downstream tasks while substantially reducing the number of trained parameters in convolutional layers. We prove that projected nodes take the generalized FFN form, enabling projected CNNs to inherit feedforward techniques that do not rely on homogeneous layer inputs. Experiments across multiple ImageNet-pretrained backbones and several downstream image classification datasets show that model projection is a strong transfer learning baseline under simple training recipes.
Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small-scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34 deg and MedAE 17.1 deg for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is publicly available.
Wireless capsule endoscopy (WCE) has transformed gastrointestinal (GI) diagnostics by enabling noninvasive visualization of the digestive tract, yet its diagnostic yield remains constrained by the absence of biopsy capability, as histological analysis is still the gold standard for confirming disease. Conventional biopsy using forceps, needles, or rotating blades is invasive, limited in reach, and carries risks of perforation or mucosal trauma, while fluid- or microbiota-sampling capsules cannot provide structured tissue for pathology, leaving a critical gap in swallowable biopsy solutions. Here we present the Kiri-Capsule, a kirigami-inspired capsule robot that integrates deployable PI-film flaps actuated by a compact dual-cam mechanism to achieve minimally invasive and repeatable tissue collection. The kirigami surface remains flat during locomotion but transforms into sharp protrusions upon cam-driven stretching, enabling controlled penetration followed by rotary scraping, with specimens retained in internal fan-shaped cavities. Bench tests confirmed that PI films exhibit a Young's modulus of approximately 20 MPa and stable deployment angles (about 34$^\circ$ at 15% strain), while ex vivo porcine studies demonstrated shallow penetration depths (median $\sim$0.61 mm, range 0.46--0.66 mm) and biopsy yields comparable to standard forceps (mean $\sim$10.9 mg for stomach and $\sim$18.9 mg for intestine), with forces within safe ranges reported for GI biopsy. These findings demonstrate that the Kiri-Capsule bridges passive imaging and functional biopsy, providing a swallowable, depth-controlled, and histology-ready solution that advances capsule-based diagnostics toward safe and effective clinical application.
Pelvic diseases in women of reproductive age represent a major global health burden, with diagnosis frequently delayed due to high anatomical variability, complicating MRI interpretation. Existing AI approaches are largely disease-specific and lack real-time compatibility, limiting generalizability and clinical integration. To address these challenges, we establish a benchmark framework for disease- and parameter-agnostic, real-time-compatible unsupervised anomaly detection in pelvic MRI. The method uses a residual variational autoencoder trained exclusively on healthy sagittal T2-weighted scans acquired across diverse imaging protocols to model normal pelvic anatomy. During inference, reconstruction error heatmaps indicate deviations from learned healthy structure, enabling detection of pathological regions without labeled abnormal data. The model is trained on 294 healthy scans and augmented with diffusion-generated synthetic data to improve robustness. Quantitative evaluation on the publicly available Uterine Myoma MRI Dataset yields an average area-under-the-curve (AUC) value of 0.736, with 0.828 sensitivity and 0.692 specificity. Additional inter-observer clinical evaluation extends analysis to endometrial cancer, endometriosis, and adenomyosis, revealing the influence of anatomical heterogeneity and inter-observer variability on performance interpretation. With a reconstruction time of approximately 92.6 frames per second, the proposed framework establishes a baseline for unsupervised anomaly detection in the female pelvis and supports future integration into real-time MRI. Code is available upon request (https://github.com/AniKnu/UADPelvis), prospective data sets are available for academic collaboration.