Abstract:Solving Singularly Perturbed Differential Equations (SPDEs) poses computational challenges arising from the rapid transitions in their solutions within thin regions. The effectiveness of deep learning in addressing differential equations motivates us to employ these methods for solving SPDEs. In this manuscript, we introduce Component Fourier Neural Operator (ComFNO), an innovative operator learning method that builds upon Fourier Neural Operator (FNO), while simultaneously incorporating valuable prior knowledge obtained from asymptotic analysis. Our approach is not limited to FNO and can be applied to other neural network frameworks, such as Deep Operator Network (DeepONet), leading to potential similar SPDEs solvers. Experimental results across diverse classes of SPDEs demonstrate that ComFNO significantly improves accuracy compared to vanilla FNO. Furthermore, ComFNO exhibits natural adaptability to diverse data distributions and performs well in few-shot scenarios, showcasing its excellent generalization ability in practical situations.
Abstract:Physics-informed deep learning (PIDL)-based models have recently garnered remarkable success in traffic state estimation (TSE). However, the prior knowledge used to guide regularization training in current mainstream architectures is based on deterministic physical models. The drawback is that a solely deterministic model fails to capture the universally observed traffic flow dynamic scattering effect, thereby yielding unreliable outcomes for traffic control. This study, for the first time, proposes stochastic physics-informed deep learning (SPIDL) for traffic state estimation. The idea behind such SPIDL is simple and is based on the fact that a stochastic fundamental diagram provides the entire range of possible speeds for any given density with associated probabilities. Specifically, we select percentile-based fundamental diagram and distribution-based fundamental diagram as stochastic physics knowledge, and design corresponding physics-uninformed neural networks for effective fusion, thereby realizing two specific SPIDL models, namely \text{$\alpha$}-SPIDL and \text{$\cal B$}-SPIDL. The main contribution of SPIDL lies in addressing the "overly centralized guidance" caused by the one-to-one speed-density relationship in deterministic models during neural network training, enabling the network to digest more reliable knowledge-based constraints.Experiments on the real-world dataset indicate that proposed SPIDL models achieve accurate traffic state estimation in sparse data scenarios. More importantly, as expected, SPIDL models reproduce well the scattering effect of field observations, demonstrating the effectiveness of fusing stochastic physics model knowledge with deep learning frameworks.
Abstract:Full waveform inversion (FWI) plays a crucial role in the field of geophysics. There has been lots of research about applying deep learning (DL) methods to FWI. The success of DL-FWI relies significantly on the quantity and diversity of the datasets. Nevertheless, existing FWI datasets, like OpenFWI, where sources have fixed locations or identical frequencies, provide limited information and do not represent the complex real-world scene. For instance, low frequencies help in resolving larger-scale structures. High frequencies allow for a more detailed subsurface features. %A single source frequency is insufficient to describe subsurface structural properties. We consider that simultaneously using sources with different frequencies, instead of performing inversion using low frequencies data and then gradually introducing higher frequencies data, has rationale and potential advantages. Hence, we develop three enhanced datasets based on OpenFWI where each source have varying locations, frequencies or both. Moreover, we propose a novel deep operator network (DeepONet) architecture Inversion-DeepONet for FWI. We utilize convolutional neural network (CNN) to extract the features from seismic data in branch net. Source parameters, such as locations and frequencies, are fed to trunk net. Then another CNN is employed as the decoder of DeepONet to reconstruct the velocity models more effectively. Through experiments, we confirm the superior performance on accuracy and generalization ability of our network, compared with existing data-driven FWI methods.
Abstract:Underwater object detection has higher requirements of running speed and deployment efficiency for the detector due to its specific environmental challenges. NMS of two- or one-stage object detectors and transformer architecture of query-based end-to-end object detectors are not conducive to deployment on underwater embedded devices with limited processing power. As for the detrimental effect of underwater color cast noise, recent underwater object detectors make network architecture or training complex, which also hinders their application and deployment on underwater vehicle platforms. In this paper, we propose the Underwater DECO with improved deNoising training (U-DECN), the query-based end-to-end object detector (with ConvNet encoder-decoder architecture) for underwater color cast noise that addresses the above problems. We integrate advanced technologies from DETR variants into DECO and design optimization methods specifically for the ConvNet architecture, including Separate Contrastive DeNoising Forward and Deformable Convolution in SIM. To address the underwater color cast noise issue, we propose an underwater color denoising query to improve the generalization of the model for the biased object feature information by different color cast noise. Our U-DECN, with ResNet-50 backbone, achieves 61.4 AP (50 epochs), 63.3 AP (72 epochs), 64.0 AP (100 epochs) on DUO, and 21 FPS (5 times faster than Deformable DETR and DINO 4 FPS) on NVIDIA AGX Orin by TensorRT FP16, outperforming the other state-of-the-art query-based end-to-end object detectors. The code is available at https://github.com/LEFTeyex/U-DECN.
Abstract:First-order methods, such as gradient descent (GD) and stochastic gradient descent (SGD), have been proven effective in training neural networks. In the context of over-parameterization, there is a line of work demonstrating that randomly initialized (stochastic) gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. However, the learning rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix, leading to a slow training process. In this paper, we show that for the $L^2$ regression problems, the learning rate can be improved from $\mathcal{O}(\lambda_0/n^2)$ to $\mathcal{O}(1/\|\bm{H}^{\infty}\|_2)$, which implies that GD actually enjoys a faster convergence rate. Furthermore, we generalize the method to GD in training two-layer Physics-Informed Neural Networks (PINNs), showing a similar improvement for the learning rate. Although the improved learning rate has a mild dependence on the Gram matrix, we still need to set it small enough in practice due to the unknown eigenvalues of the Gram matrix. More importantly, the convergence rate is tied to the least eigenvalue of the Gram matrix, which can lead to slow convergence. In this work, we provide the convergence analysis of natural gradient descent (NGD) in training two-layer PINNs, demonstrating that the learning rate can be $\mathcal{O}(1)$, and at this rate, the convergence rate is independent of the Gram matrix.
Abstract:We introduce PRANCE, a Vision Transformer compression framework that jointly optimizes the activated channels and reduces tokens, based on the characteristics of inputs. Specifically, PRANCE~ leverages adaptive token optimization strategies for a certain computational budget, aiming to accelerate ViTs' inference from a unified data and architectural perspective. However, the joint framework poses challenges to both architectural and decision-making aspects. Firstly, while ViTs inherently support variable-token inference, they do not facilitate dynamic computations for variable channels. To overcome this limitation, we propose a meta-network using weight-sharing techniques to support arbitrary channels of the Multi-head Self-Attention and Multi-layer Perceptron layers, serving as a foundational model for architectural decision-making. Second, simultaneously optimizing the structure of the meta-network and input data constitutes a combinatorial optimization problem with an extremely large decision space, reaching up to around $10^{14}$, making supervised learning infeasible. To this end, we design a lightweight selector employing Proximal Policy Optimization for efficient decision-making. Furthermore, we introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a Markov decision process, significantly reducing action space and mitigating delayed-reward issues during training. Extensive experiments demonstrate the effectiveness of PRANCE~ in reducing FLOPs by approximately 50\%, retaining only about 10\% of tokens while achieving lossless Top-1 accuracy. Additionally, our framework is shown to be compatible with various token optimization techniques such as pruning, merging, and sequential pruning-merging strategies. The code is available at \href{https://github.com/ChildTang/PRANCE}{https://github.com/ChildTang/PRANCE}.
Abstract:Optimization algorithms is crucial in training physics-informed neural networks (PINNs), unsuitable methods may lead to poor solutions. Compared to the common gradient descent algorithm, implicit gradient descent (IGD) outperforms it in handling some multi-scale problems. In this paper, we provide convergence analysis for the implicit gradient descent for training over-parametrized two-layer PINNs. We first demonstrate the positive definiteness of Gram matrices for general smooth activation functions, like sigmoidal function, softplus function, tanh function and so on. Then the over-parameterization allows us to show that the randomly initialized IGD converges a globally optimal solution at a linear convergence rate. Moreover, due to the different training dynamics, the learning rate of IGD can be chosen independent of the sample size and the least eigenvalue of the Gram matrix.
Abstract:Large language models (LLMs) have captured significant interest from both academia and industry due to their impressive performance across various textual tasks. However, the potential of LLMs to analyze physiological time-series data remains an emerging research field. Particularly, there is a notable gap in the utilization of LLMs for analyzing wearable biosignals to achieve cuffless blood pressure (BP) measurement, which is critical for the management of cardiovascular diseases. This paper presents the first work to explore the capacity of LLMs to perform cuffless BP estimation based on wearable biosignals. We extracted physiological features from electrocardiogram (ECG) and photoplethysmogram (PPG) signals and designed context-enhanced prompts by combining these features with BP domain knowledge and user information. Subsequently, we adapted LLMs to BP estimation tasks through fine-tuning. To evaluate the proposed approach, we conducted assessments of ten advanced LLMs using a comprehensive public dataset of wearable biosignals from 1,272 participants. The experimental results demonstrate that the optimally fine-tuned LLM significantly surpasses conventional task-specific baselines, achieving an estimation error of 0.00 $\pm$ 9.25 mmHg for systolic BP and 1.29 $\pm$ 6.37 mmHg for diastolic BP. Notably, the ablation studies highlight the benefits of our context enhancement strategy, leading to an 8.9% reduction in mean absolute error for systolic BP estimation. This paper pioneers the exploration of LLMs for cuffless BP measurement, providing a potential solution to enhance the accuracy of cuffless BP measurement.
Abstract:Embodied agents require robust navigation systems to operate in unstructured environments, making the robustness of Simultaneous Localization and Mapping (SLAM) models critical to embodied agent autonomy. While real-world datasets are invaluable, simulation-based benchmarks offer a scalable approach for robustness evaluations. However, the creation of a challenging and controllable noisy world with diverse perturbations remains under-explored. To this end, we propose a novel, customizable pipeline for noisy data synthesis, aimed at assessing the resilience of multi-modal SLAM models against various perturbations. The pipeline comprises a comprehensive taxonomy of sensor and motion perturbations for embodied multi-modal (specifically RGB-D) sensing, categorized by their sources and propagation order, allowing for procedural composition. We also provide a toolbox for synthesizing these perturbations, enabling the transformation of clean environments into challenging noisy simulations. Utilizing the pipeline, we instantiate the large-scale Noisy-Replica benchmark, which includes diverse perturbation types, to evaluate the risk tolerance of existing advanced RGB-D SLAM models. Our extensive analysis uncovers the susceptibilities of both neural (NeRF and Gaussian Splatting -based) and non-neural SLAM models to disturbances, despite their demonstrated accuracy in standard benchmarks. Our code is publicly available at https://github.com/Xiaohao-Xu/SLAM-under-Perturbation.
Abstract:Vision camera and sonar are naturally complementary in the underwater environment. Combining the information from two modalities will promote better observation of underwater targets. However, this problem has not received sufficient attention in previous research. Therefore, this paper introduces a new challenging RGB-Sonar (RGB-S) tracking task and investigates how to achieve efficient tracking of an underwater target through the interaction of RGB and sonar modalities. Specifically, we first propose an RGBS50 benchmark dataset containing 50 sequences and more than 87000 high-quality annotated bounding boxes. Experimental results show that the RGBS50 benchmark poses a challenge to currently popular SOT trackers. Second, we propose an RGB-S tracker called SCANet, which includes a spatial cross-attention module (SCAM) consisting of a novel spatial cross-attention layer and two independent global integration modules. The spatial cross-attention is used to overcome the problem of spatial misalignment of between RGB and sonar images. Third, we propose a SOT data-based RGB-S simulation training method (SRST) to overcome the lack of RGB-S training datasets. It converts RGB images into sonar-like saliency images to construct pseudo-data pairs, enabling the model to learn the semantic structure of RGB-S-like data. Comprehensive experiments show that the proposed spatial cross-attention effectively achieves the interaction between RGB and sonar modalities and SCANet achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/RGBS50.