Abstract:Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/SonarT165.
Abstract:Bin picking is a challenging robotic task due to occlusions and physical constraints that limit visual information for object recognition and grasping. Existing approaches often rely on known CAD models or prior object geometries, restricting generalization to novel or unknown objects. Other methods directly regress grasp poses from RGB-D data without object priors, but the inherent noise in depth sensing and the lack of object understanding make grasp synthesis and evaluation more difficult. Superquadrics (SQ) offer a compact, interpretable shape representation that captures the physical and graspability understanding of objects. However, recovering them from limited viewpoints is challenging, as existing methods rely on multiple perspectives for near-complete point cloud reconstruction, limiting their effectiveness in bin-picking. To address these challenges, we propose \textbf{RGBSQGrasp}, a grasping framework that leverages superquadric shape primitives and foundation metric depth estimation models to infer grasp poses from a monocular RGB camera -- eliminating the need for depth sensors. Our framework integrates a universal, cross-platform dataset generation pipeline, a foundation model-based object point cloud estimation module, a global-local superquadric fitting network, and an SQ-guided grasp pose sampling module. By integrating these components, RGBSQGrasp reliably infers grasp poses through geometric reasoning, enhancing grasp stability and adaptability to unseen objects. Real-world robotic experiments demonstrate a 92\% grasp success rate, highlighting the effectiveness of RGBSQGrasp in packed bin-picking environments.
Abstract:Despite great progress in multimodal tracking, these trackers remain too heavy and expensive for resource-constrained devices. To alleviate this problem, we propose LightFC-X, a family of lightweight convolutional RGB-X trackers that explores a unified convolutional architecture for lightweight multimodal tracking. Our core idea is to achieve lightweight cross-modal modeling and joint refinement of the multimodal features and the spatiotemporal appearance features of the target. Specifically, we propose a novel efficient cross-attention module (ECAM) and a novel spatiotemporal template aggregation module (STAM). The ECAM achieves lightweight cross-modal interaction of template-search area integrated feature with only 0.08M parameters. The STAM enhances the model's utilization of temporal information through module fine-tuning paradigm. Comprehensive experiments show that our LightFC-X achieves state-of-the-art performance and the optimal balance between parameters, performance, and speed. For example, LightFC-T-ST outperforms CMD by 4.3% and 5.7% in SR and PR on the LasHeR benchmark, which it achieves 2.6x reduction in parameters and 2.7x speedup. It runs in real-time on the CPU at a speed of 22 fps. The code is available at https://github.com/LiYunfengLYF/LightFC-X.
Abstract:We aim to redefine robust ego-motion estimation and photorealistic 3D reconstruction by addressing a critical limitation: the reliance on noise-free data in existing models. While such sanitized conditions simplify evaluation, they fail to capture the unpredictable, noisy complexities of real-world environments. Dynamic motion, sensor imperfections, and synchronization perturbations lead to sharp performance declines when these models are deployed in practice, revealing an urgent need for frameworks that embrace and excel under real-world noise. To bridge this gap, we tackle three core challenges: scalable data generation, comprehensive benchmarking, and model robustness enhancement. First, we introduce a scalable noisy data synthesis pipeline that generates diverse datasets simulating complex motion, sensor imperfections, and synchronization errors. Second, we leverage this pipeline to create Robust-Ego3D, a benchmark rigorously designed to expose noise-induced performance degradation, highlighting the limitations of current learning-based methods in ego-motion accuracy and 3D reconstruction quality. Third, we propose Correspondence-guided Gaussian Splatting (CorrGS), a novel test-time adaptation method that progressively refines an internal clean 3D representation by aligning noisy observations with rendered RGB-D frames from clean 3D map, enhancing geometric alignment and appearance restoration through visual correspondence. Extensive experiments on synthetic and real-world data demonstrate that CorrGS consistently outperforms prior state-of-the-art methods, particularly in scenarios involving rapid motion and dynamic illumination.
Abstract:As large language models (LLMs) continue to advance, the demand for higher quality and faster processing of long contexts across various applications is growing. KV cache is widely adopted as it stores previously generated key and value tokens, effectively reducing redundant computations during inference. However, as memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention. Most existing methods perform compression from two perspectives: identifying important tokens and designing compression strategies. However, these approaches often produce biased distributions of important tokens due to the influence of accumulated attention scores or positional encoding. Furthermore, they overlook the sparsity and redundancy across different heads, which leads to difficulties in preserving the most effective information at the head level. To this end, we propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios. Specifically, we introduce a Global-Local score that combines accumulated attention scores from both global and local KV tokens to better identify the token importance. For the compression strategy, we design an adaptive and unified Evict-then-Merge framework that accounts for the sparsity and redundancy of KV tokens across different heads. Additionally, we implement the head-wise parallel compression through a zero-class mechanism to enhance efficiency. Extensive experiments demonstrate our SOTA performance even under extreme compression ratios. EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.
Abstract:Neural operators are aiming at approximating operators mapping between Banach spaces of functions, achieving much success in the field of scientific computing. Compared to certain deep learning-based solvers, such as Physics-Informed Neural Networks (PINNs), Deep Ritz Method (DRM), neural operators can solve a class of Partial Differential Equations (PDEs). Although much work has been done to analyze the approximation and generalization error of neural operators, there is still a lack of analysis on their training error. In this work, we conduct the convergence analysis of gradient descent for the wide shallow neural operators within the framework of Neural Tangent Kernel (NTK). The core idea lies on the fact that over-parameterization and random initialization together ensure that each weight vector remains near its initialization throughout all iterations, yielding the linear convergence of gradient descent. In this work, we demonstrate that under the setting of over-parametrization, gradient descent can find the global minimum regardless of whether it is in continuous time or discrete time.
Abstract:Vision-centric autonomous driving has demonstrated excellent performance with economical sensors. As the fundamental step, 3D perception aims to infer 3D information from 2D images based on 3D-2D projection. This makes driving perception models susceptible to sensor configuration (e.g., camera intrinsics and extrinsics) variations. However, generalizing across camera configurations is important for deploying autonomous driving models on different car models. In this paper, we present UniDrive, a novel framework for vision-centric autonomous driving to achieve universal perception across camera configurations. We deploy a set of unified virtual cameras and propose a ground-aware projection method to effectively transform the original images into these unified virtual views. We further propose a virtual configuration optimization method by minimizing the expected projection error between original cameras and virtual cameras. The proposed virtual camera projection can be applied to existing 3D perception methods as a plug-and-play module to mitigate the challenges posed by camera parameter variability, resulting in more adaptable and reliable driving perception models. To evaluate the effectiveness of our framework, we collect a dataset on Carla by driving the same routes while only modifying the camera configurations. Experimental results demonstrate that our method trained on one specific camera configuration can generalize to varying configurations with minor performance degradation.
Abstract:Emotion Recognition in Conversations (ERCs) is a vital area within multimodal interaction research, dedicated to accurately identifying and classifying the emotions expressed by speakers throughout a conversation. Traditional ERC approaches predominantly rely on unimodal cues\-such as text, audio, or visual data\-leading to limitations in their effectiveness. These methods encounter two significant challenges: 1) Consistency in multimodal information. Before integrating various modalities, it is crucial to ensure that the data from different sources is aligned and coherent. 2) Contextual information capture. Successfully fusing multimodal features requires a keen understanding of the evolving emotional tone, especially in lengthy dialogues where emotions may shift and develop over time. To address these limitations, we propose a novel Mamba-enhanced Text-Audio-Video alignment network (MaTAV) for the ERC task. MaTAV is with the advantages of aligning unimodal features to ensure consistency across different modalities and handling long input sequences to better capture contextual multimodal information. The extensive experiments on the MELD and IEMOCAP datasets demonstrate that MaTAV significantly outperforms existing state-of-the-art methods on the ERC task with a big margin.
Abstract:Continuous blood pressure (BP) monitoring is essential for timely diagnosis and intervention in critical care settings. However, BP varies significantly across individuals, this inter-patient variability motivates the development of personalized models tailored to each patient's physiology. In this work, we propose a personalized BP forecasting model mainly using electrocardiogram (ECG) and photoplethysmogram (PPG) signals. This time-series model incorporates 2D representation learning to capture complex physiological relationships. Experiments are conducted on datasets collected from three diverse scenarios with BP measurements from 60 subjects total. Results demonstrate that the model achieves accurate and robust BP forecasts across scenarios within the Association for the Advancement of Medical Instrumentation (AAMI) standard criteria. This reliable early detection of abnormal fluctuations in BP is crucial for at-risk patients undergoing surgery or intensive care. The proposed model provides a valuable addition for continuous BP tracking to reduce mortality and improve prognosis.
Abstract:Solving Singularly Perturbed Differential Equations (SPDEs) poses computational challenges arising from the rapid transitions in their solutions within thin regions. The effectiveness of deep learning in addressing differential equations motivates us to employ these methods for solving SPDEs. In this manuscript, we introduce Component Fourier Neural Operator (ComFNO), an innovative operator learning method that builds upon Fourier Neural Operator (FNO), while simultaneously incorporating valuable prior knowledge obtained from asymptotic analysis. Our approach is not limited to FNO and can be applied to other neural network frameworks, such as Deep Operator Network (DeepONet), leading to potential similar SPDEs solvers. Experimental results across diverse classes of SPDEs demonstrate that ComFNO significantly improves accuracy compared to vanilla FNO. Furthermore, ComFNO exhibits natural adaptability to diverse data distributions and performs well in few-shot scenarios, showcasing its excellent generalization ability in practical situations.