Parameter-efficient fine-tuning for pre-trained Vision Transformers aims to adeptly tailor a model to downstream tasks by learning a minimal set of new adaptation parameters while preserving the frozen majority of pre-trained parameters. Striking a balance between retaining the generalizable representation capacity of the pre-trained model and acquiring task-specific features poses a key challenge. Currently, there is a lack of focus on guiding this delicate trade-off. In this study, we approach the problem from the perspective of Singular Value Decomposition (SVD) of pre-trained parameter matrices, providing insights into the tuning dynamics of existing methods. Building upon this understanding, we propose a Residual-based Low-Rank Rescaling (RLRR) fine-tuning strategy. This strategy not only enhances flexibility in parameter tuning but also ensures that new parameters do not deviate excessively from the pre-trained model through a residual design. Extensive experiments demonstrate that our method achieves competitive performance across various downstream image classification tasks, all while maintaining comparable new parameters. We believe this work takes a step forward in offering a unified perspective for interpreting existing methods and serves as motivation for the development of new approaches that move closer to effectively considering the crucial trade-off mentioned above. Our code is available at \href{https://github.com/zstarN70/RLRR.git}{https://github.com/zstarN70/RLRR.git}.
Comprehensive perception of human beings is the prerequisite to ensure the safety of human-robot interaction. Currently, prevailing visual sensing approach typically involves a single static camera, resulting in a restricted and occluded field of view. In our work, we develop an active vision system using multiple cameras to dynamically capture multi-source RGB-D data. An integrated human sensing strategy based on a hierarchically connected tree structure is proposed to fuse localized visual information. Constituting the tree model are the nodes representing keypoints and the edges representing keyparts, which are consistently interconnected to preserve the structural constraints during multi-source fusion. Utilizing RGB-D data and HRNet, the 3D positions of keypoints are analytically estimated, and their presence is inferred through a sliding widow of confidence scores. Subsequently, the point clouds of reliable keyparts are extracted by drawing occlusion-resistant masks, enabling fine registration between data clouds and cylindrical model following the hierarchical order. Experimental results demonstrate that our method enhances keypart recognition recall from 69.20% to 90.10%, compared to employing a single static camera. Furthermore, in overcoming challenges related to localized and occluded perception, the robotic arm's obstacle avoidance capabilities are effectively improved.
Collision detection via visual fences can significantly enhance the safety of collaborative robotic arms. Existing work typically performs such detection based on pre-deployed stationary cameras outside the robotic arm's workspace. These stationary cameras can only provide a restricted detection range and constrain the mobility of the robotic system. To cope with this issue, we propose an active sense method enabling a wide range of collision risk evaluation in dynamic scenarios. First, an active vision mechanism is implemented by equipping cameras with additional degrees of rotation. Considering the uncertainty in the active sense, we design a state confidence envelope to uniformly characterize both known and potential dynamic obstacles. Subsequently, using the observation-based uncertainty evolution, collision risk is evaluated by the prediction of obstacle envelopes. On this basis, a Markov decision process was employed to search for an optimal observation sequence of the active sense system, which enlarges the field of observation and reduces uncertainties in the state estimation of surrounding obstacles. Simulation and real-world experiments consistently demonstrate a 168% increase in the observation time coverage of typical dynamic humanoid obstacles compared to the method using stationary cameras, which underscores our system's effectiveness in collision risk tracking and enhancing the safety of robotic arms.
Leveraging multiple cameras on Unmanned Aerial Vehicles (UAVs) to form a variable-baseline stereo camera for collaborative perception is highly promising. The critical steps include high-rate cross-camera feature association and frame-rate relative pose estimation of multiple UAVs. To accelerate the feature association rate to match the frame rate, we propose a dual-channel structure to decouple the time-consuming feature detection and match from the high-rate image stream. The novel design of periodic guidance and fast prediction effectively utilizes each image frame to achieve a frame-rate feature association. Real-world experiments are executed using SuperPoint and SuperGlue on the NVIDIA NX 8G platform with a 30 Hz image stream. Using single-channel SuperPoint and SuperGlue can only achieve 13 Hz feature association. The proposed dual-channel method can improve the rate of feature association from 13 Hz to 30 Hz, supporting the frame-rate requirement. To accommodate the proposed feature association, we develop a Multi-State Constrained Kalman Filter (MSCKF)-based relative pose estimator in the back-end by fusing the local odometry from two UAVs together with the measurements of common features. Experiments show that the dual-channel feature association improves the rate of visual observation and enhances the real-time performance of back-end estimator compared to the existing methods. Video - https://youtu.be/UBAR1iP0GPk Supplementary video - https://youtu.be/nPq8EpVzJZM
The aspiration of the next generation's autonomous driving (AD) technology relies on the dedicated integration and interaction among intelligent perception, prediction, planning, and low-level control. There has been a huge bottleneck regarding the upper bound of autonomous driving algorithm performance, a consensus from academia and industry believes that the key to surmount the bottleneck lies in data-centric autonomous driving technology. Recent advancement in AD simulation, closed-loop model training, and AD big data engine have gained some valuable experience. However, there is a lack of systematic knowledge and deep understanding regarding how to build efficient data-centric AD technology for AD algorithm self-evolution and better AD big data accumulation. To fill in the identified research gaps, this article will closely focus on reviewing the state-of-the-art data-driven autonomous driving technologies, with an emphasis on the comprehensive taxonomy of autonomous driving datasets characterized by milestone generations, key features, data acquisition settings, etc. Furthermore, we provide a systematic review of the existing benchmark closed-loop AD big data pipelines from the industrial frontier, including the procedure of closed-loop frameworks, key technologies, and empirical studies. Finally, the future directions, potential applications, limitations and concerns are discussed to arouse efforts from both academia and industry for promoting the further development of autonomous driving. The project repository is available at: https://github.com/LincanLi98/Awesome-Data-Centric-Autonomous-Driving.
High Dynamic Range (HDR) images can be recovered from several Low Dynamic Range (LDR) images by existing Deep Neural Networks (DNNs) techniques. Despite the remarkable progress, DNN-based methods still generate ghosting artifacts when LDR images have saturation and large motion, which hinders potential applications in real-world scenarios. To address this challenge, we formulate the HDR deghosting problem as an image generation that leverages LDR features as the diffusion model's condition, consisting of the feature condition generator and the noise predictor. Feature condition generator employs attention and Domain Feature Alignment (DFA) layer to transform the intermediate features to avoid ghosting artifacts. With the learned features as conditions, the noise predictor leverages a stochastic iterative denoising process for diffusion models to generate an HDR image by steering the sampling process. Furthermore, to mitigate semantic confusion caused by the saturation problem of LDR images, we design a sliding window noise estimator to sample smooth noise in a patch-based manner. In addition, an image space loss is proposed to avoid the color distortion of the estimated HDR results. We empirically evaluate our model on benchmark datasets for HDR imaging. The results demonstrate that our approach achieves state-of-the-art performances and well generalization to real-world images.
Structural Health Monitoring (SHM) is crucial for the safety and maintenance of various infrastructures. Due to the large amount of data generated by numerous sensors and the high real-time requirements of many applications, SHM poses significant challenges. Although the cloud-centric stream computing paradigm opens new opportunities for real-time data processing, it consumes too much network bandwidth. In this paper, we propose ECStream, an Edge Cloud collaborative fine-grained stream operator scheduling framework for SHM. We collectively consider atomic and composite operators together with their iterative computability to model and formalize the problem of minimizing bandwidth usage and end-to-end operator processing latency. Preliminary evaluation results show that ECStream can effectively balance bandwidth usage and end-to-end operator computation latency, reducing bandwidth usage by 73.01% and latency by 34.08% on average compared to the cloud-centric approach.
The advent of high-capacity pre-trained models has revolutionized problem-solving in computer vision, shifting the focus from training task-specific models to adapting pre-trained models. Consequently, effectively adapting large pre-trained models to downstream tasks in an efficient manner has become a prominent research area. Existing solutions primarily concentrate on designing lightweight adapters and their interaction with pre-trained models, with the goal of minimizing the number of parameters requiring updates. In this study, we propose a novel Adapter Re-Composing (ARC) strategy that addresses efficient pre-trained model adaptation from a fresh perspective. Our approach considers the reusability of adaptation parameters and introduces a parameter-sharing scheme. Specifically, we leverage symmetric down-/up-projections to construct bottleneck operations, which are shared across layers. By learning low-dimensional re-scaling coefficients, we can effectively re-compose layer-adaptive adapters. This parameter-sharing strategy in adapter design allows us to significantly reduce the number of new parameters while maintaining satisfactory performance, thereby offering a promising approach to compress the adaptation cost. We conduct experiments on 24 downstream image classification tasks using various Vision Transformer variants to evaluate our method. The results demonstrate that our approach achieves compelling transfer learning performance with a reduced parameter count. Our code is available at \href{https://github.com/DavidYanAnDe/ARC}{https://github.com/DavidYanAnDe/ARC}.
Decoupled parallel $xyz$ positioning stages with large stroke have been desired in high-speed and precise positioning fields. However, currently such stages are either short in stroke or unqualified in parasitic motion and coupling rate. This paper proposes a novel flexure-based decoupled parallel $xyz$ positioning stage (FlexDelta) and conducts its conceptual design, modeling, and experimental study. Firstly, the working principle of FlexDelta is introduced, followed by its mechanism design with flexure. Secondly, the stiffness model of flexure is established via matrix-based Castigliano's second theorem, and the influence of its lateral stiffness on the stiffness model of FlexDelta is comprehensively investigated and then optimally designed. Finally, experimental study was carried out based on the prototype fabricated. The results reveal that the positioning stage features centimeter-stroke in three axes, with coupling rate less than 0.53%, parasitic motion less than 1.72 mrad over full range. And its natural frequencies are 20.8 Hz, 20.8 Hz, and 22.4 Hz for $x$, $y$, and $z$ axis respectively. Multi-axis path tracking tests were also carried out, which validates its dynamic performance with micrometer error.
Indoor scene reconstruction from monocular images has long been sought after by augmented reality and robotics developers. Recent advances in neural field representations and monocular priors have led to remarkable results in scene-level surface reconstructions. The reliance on Multilayer Perceptrons (MLP), however, significantly limits speed in training and rendering. In this work, we propose to directly use signed distance function (SDF) in sparse voxel block grids for fast and accurate scene reconstruction without MLPs. Our globally sparse and locally dense data structure exploits surfaces' spatial sparsity, enables cache-friendly queries, and allows direct extensions to multi-modal data such as color and semantic labels. To apply this representation to monocular scene reconstruction, we develop a scale calibration algorithm for fast geometric initialization from monocular depth priors. We apply differentiable volume rendering from this initialization to refine details with fast convergence. We also introduce efficient high-dimensional Continuous Random Fields (CRFs) to further exploit the semantic-geometry consistency between scene objects. Experiments show that our approach is 10x faster in training and 100x faster in rendering while achieving comparable accuracy to state-of-the-art neural implicit methods.