Abstract:Multi-modal cross-view place recognition remains a fundamental challenge in computer vision and robotics due to the severe viewpoint, modality, and spatial-structure discrepancies between ground observations and aerial references. To address this challenge, we present MAG-VLAQ, a foundation-model-enhanced query aggregation framework for multi-modal aerial-ground cross-view place recognition. Specifically, our approach leverages pre-trained foundation models to extract dense visual tokens from both ground and aerial images, as well as expressive geometric tokens from ground LiDAR observations. These heterogeneous tokens are then projected into a shared embedding space for cross-modal alignment and fusion. As our main contribution, we propose ODE-conditioned VLAQ, which tightly couples neural ordinary differential equations (ODE)-based RGB-LiDAR fusion with vectors of locally aggregated queries (VLAQ). In this design, the VLAQ query centers are dynamically adapted according to the fused multi-modal state. This mechanism allows the final global descriptor to preserve globally learned retrieval prototypes while remaining responsive to scene-specific visual and geometric evidence, significantly improving aerial-ground matching. Extensive experiments on KITTI360-AG and nuScenes-AG validate the effectiveness of our proposed MAG-VLAQ. Notably, on KITTI360-AG, our MAG-VLAQ nearly doubles the state-of-the-art performance, achieving 61.1 Recall@1 in the satellite setting, compared with 34.5 from the closest competing approach.
Abstract:Hypertrophic Cardiomyopathy (HCM) is a genetic heart disease affecting approximately 1 in 500 people and is the leading cause of sudden cardiac death in young athletes. Current diagnostic methods -- cardiovascular magnetic resonance (CMR), echocardiography, and genetic testing -- are limited by high costs, operator dependency, or insufficient accuracy, while standard electrocardiogram (ECG) analysis cannot reliably distinguish HCM from acquired left ventricular hypertrophy (LVH). This paper presents a wearable ECG device paired with a classification algorithm that differentiates HCM from acquired LVH using ECG signals alone. The portable device integrates a 3-lead electrode system, an AD8232 signal conditioning module, an Arduino Nano 33 BLE microcontroller, and a lithium polymer battery. The algorithm extracts two quantitative indices -- HCM Index~1 and HCM Index~2 -- from each heartbeat and classifies patients via dual statistical thresholds. Validation on 483 LVH patients (PhysioNet) and 29 HCM patients (digitized clinical records) yields 75.86\% sensitivity, 99.17\% specificity, and an F1-score of 80.00\%. Leave-one-out cross-validation confirms generalizability, with cross-validated sensitivity of 72.41\%, specificity of 98.96\%, and F1-score of 76.36\% (95\% confidence intervals reported). A digitization confound analysis demonstrates that the classification is driven by physiological cardiac features rather than data source artifacts. A simulated device acquisition chain analysis confirms that the wearable hardware's signal characteristics are compatible with the classification algorithm. The system offers a promising tool for affordable HCM screening in resource-limited settings.
Abstract:One of the central challenges in visual place recognition (VPR) is learning a robust global representation that remains discriminative under large viewpoint changes, illumination variations, and severe domain shifts. While visual foundation models (VFMs) provide strong local features, most existing methods rely on a single model, overlooking the complementary cues offered by different VFMs. However, exploiting such complementary information inevitably alters token distributions, which challenges the stability of existing query-based global aggregation schemes. To address these challenges, we propose DC-VLAQ, a representation-centric framework that integrates the fusion of complementary VFMs and robust global aggregation. Specifically, we first introduce a lightweight residual-guided complementary fusion that anchors representations in the DINOv2 feature space while injecting complementary semantics from CLIP through a learned residual correction. In addition, we propose the Vector of Local Aggregated Queries (VLAQ), a query--residual global aggregation scheme that encodes local tokens by their residual responses to learnable queries, resulting in improved stability and the preservation of fine-grained discriminative cues. Extensive experiments on standard VPR benchmarks, including Pitts30k, Tokyo24/7, MSLS, Nordland, SPED, and AmsterTime, demonstrate that DC-VLAQ consistently outperforms strong baselines and achieves state-of-the-art performance, particularly under challenging domain shifts and long-term appearance changes.




Abstract:Humans watch more than a billion hours of video per day. Most of this video was edited manually, which is a tedious process. However, AI-enabled video-generation and video-editing is on the rise. Building on text-to-image models like Stable Diffusion and Imagen, generative AI has improved dramatically on video tasks. But it's hard to evaluate progress in these video tasks because there is no standard benchmark. So, we propose a new dataset for text-guided video editing (TGVE), and we run a competition at CVPR to evaluate models on our TGVE dataset. In this paper we present a retrospective on the competition and describe the winning method. The competition dataset is available at https://sites.google.com/view/loveucvpr23/track4.