Unsupervised depth completion aims to recover dense depth from the sparse one without using the ground-truth annotation. Although depth measurement obtained from LiDAR is usually sparse, it contains valid and real distance information, i.e., scale-consistent absolute depth values. Meanwhile, scale-agnostic counterparts seek to estimate relative depth and have achieved impressive performance. To leverage both the inherent characteristics, we thus suggest to model scale-consistent depth upon unsupervised scale-agnostic frameworks. Specifically, we propose the decomposed scale-consistent learning (DSCL) strategy, which disintegrates the absolute depth into relative depth prediction and global scale estimation, contributing to individual learning benefits. But unfortunately, most existing unsupervised scale-agnostic frameworks heavily suffer from depth holes due to the extremely sparse depth input and weak supervised signal. To tackle this issue, we introduce the global depth guidance (GDG) module, which attentively propagates dense depth reference into the sparse target via novel dense-to-sparse attention. Extensive experiments show the superiority of our method on outdoor KITTI benchmark, ranking 1st and outperforming the best KBNet more than 12% in RMSE. In addition, our approach achieves state-of-the-art performance on indoor NYUv2 dataset.
Tensegrity robots, composed of rigid rods and flexible cables, exhibit high strength-to-weight ratios and extreme deformations, enabling them to navigate unstructured terrain and even survive harsh impacts. However, they are hard to control due to their high dimensionality, complex dynamics, and coupled architecture. Physics-based simulation is one avenue for developing locomotion policies that can then be transferred to real robots, but modeling tensegrity robots is a complex task, so simulations experience a substantial sim2real gap. To address this issue, this paper describes a Real2Sim2Real strategy for tensegrity robots. This strategy is based on a differential physics engine that can be trained given limited data from a real robot (i.e. offline measurements and one random trajectory) and achieve a high enough accuracy to discover transferable locomotion policies. Beyond the overall pipeline, key contributions of this work include computing non-zero gradients at contact points, a loss function, and a trajectory segmentation technique that avoid conflicts in gradient evaluation during training. The proposed pipeline is demonstrated and evaluated on a real 3-bar tensegrity robot.
Modern LiDAR sensors find increasing use in safety-critical applications. Therefore, highly accurate modeling of the system's behavior under demanding environmental conditions is necessary. In this paper, we present a modular structure to accurately simulate the amplified raw detector signal of a direct time-of-flight LiDAR system for coaxial transmitter-receiver optics. Our model describes, a measurement system based on standard optical components and a detector able of converting single photons to an electrical signal. To verify the model's predictions, single-point measurements for targets of different reflectivity at defined distances were performed. Statistical analysis shows an R-squared value greater than 0.990 for simulated and measured signal amplitude levels. Noise modeling shows good accordance with the performed measurements for different target irradiance levels. The presented results have a guiding significance in the modeling of the complex signal processing chain of LiDAR systems, as it enables the prediction of key parameters of the system early in the development process. Hence, unnecessary costs by design flaws can be mitigated. The modular structure allows easy adaption for arbitrary LiDAR systems.
Tensegrity robots, which are composed of rigid compressive elements (rods) and flexible tensile elements (e.g., cables), have a variety of advantages, including flexibility, light weight, and resistance to mechanical impact. Nevertheless, the hybrid soft-rigid nature of these robots also complicates the ability to localize and track their state. This work aims to address what has been recognized as a grand challenge in this domain, i.e., the pose tracking of tensegrity robots through a markerless, vision-based method, as well as novel, onboard sensors that can measure the length of the robot's cables. In particular, an iterative optimization process is proposed to estimate the 6-DoF poses of each rigid element of a tensegrity robot from an RGB-D video as well as endcap distance measurements from the cable sensors. To ensure the pose estimates of rigid elements are physically feasible, i.e., they are not resulting in collisions between rods or with the environment, physical constraints are introduced during the optimization. Real-world experiments are performed with a 3-bar tensegrity robot, which performs locomotion gaits. Given ground truth data from a motion capture system, the proposed method achieves less than 1 cm translation error and 3 degrees rotation error, which significantly outperforms alternatives. At the same time, the approach can provide pose estimates throughout the robot's motion, while motion capture often fails due to occlusions.
To address the problem that traditional network traffic anomaly detection algorithms do not suffi-ciently mine potential features in long time domain, an anomaly detection method based on mul-ti-scale residual features of network traffic is proposed. The original traffic is divided into subse-quences of different time spans using sliding windows, and each subsequence is decomposed and reconstructed into data sequences of different levels using wavelet transform technique; the stacked autoencoder (SAE) constructs similar feature space using normal network traffic, and gen-erates reconstructed error vector using the difference between reconstructed samples and input samples in the similar feature space; the multi-path residual group is used to learn reconstructed error The traffic classification is completed by a lightweight classifier. The experimental results show that the detection performance of the proposed method for anomalous network traffic is sig-nificantly improved compared with traditional methods; it confirms that the longer time span and more S transformation scales have positive effects on discovering potential diversity information in the original network traffic.
Document-level Event Causality Identification (DECI) aims to identify causal relations between event pairs in a document. It poses a great challenge of across-sentence reasoning without clear causal indicators. In this paper, we propose a novel Event Relational Graph TransfOrmer (ERGO) framework for DECI, which improves existing state-of-the-art (SOTA) methods upon two aspects. First, we formulate DECI as a node classification problem by constructing an event relational graph, without the needs of prior knowledge or tools. Second, ERGO seamlessly integrates event-pair relation classification and global inference, which leverages a Relational Graph Transformer (RGT) to capture the potential causal chain. Besides, we introduce edge-building strategies and adaptive focal loss to deal with the massive false positives caused by common spurious correlation. Extensive experiments on two benchmark datasets show that ERGO significantly outperforms previous SOTA methods (13.1% F1 gains on average). We have conducted extensive quantitative analysis and case studies to provide insights for future research directions (Section 4.8).
Realistic visual media synthesis is becoming a critical societal issue with the surge of face manipulation models; new forgery approaches emerge at an unprecedented pace. Unfortunately, existing forgery detection methods suffer significant performance drops when applied to novel forgery approaches. In this work, we address the few-shot forgery detection problem by designing a comprehensive benchmark based on coverage analysis among various forgery approaches, and proposing Guided Adversarial Interpolation (GAI). Our key insight is that there exist transferable distribution characteristics among different forgery approaches with the majority and minority classes. Specifically, we enhance the discriminative ability against novel forgery approaches via adversarially interpolating the artifacts of the minority samples to the majority samples under the guidance of a teacher network. Unlike the standard re-balancing method which usually results in over-fitting to minority classes, our method simultaneously takes account of the diversity of majority information as well as the significance of minority information. Extensive experiments demonstrate that our GAI achieves state-of-the-art performances on the established few-shot forgery detection benchmark. Notably, our method is also validated to be robust to choices of majority and minority forgery approaches.
In this paper, we propose an effective yet efficient model PAIE for both sentence-level and document-level Event Argument Extraction (EAE), which also generalizes well when there is a lack of training data. On the one hand, PAIE utilizes prompt tuning for extractive objectives to take the best advantages of Pre-trained Language Models (PLMs). It introduces two span selectors based on the prompt to select start/end tokens among input texts for each role. On the other hand, it captures argument interactions via multi-role prompts and conducts joint optimization with optimal span assignments via a bipartite matching loss. Also, with a flexible prompt design, PAIE can extract multiple arguments with the same role instead of conventional heuristic threshold tuning. We have conducted extensive experiments on three benchmarks, including both sentence- and document-level EAE. The results present promising improvements from PAIE (3.5\% and 2.3\% F1 gains in average on three benchmarks, for PAIE-base and PAIE-large respectively). Further analysis demonstrates the efficiency, generalization to few-shot settings, and effectiveness of different extractive prompt tuning strategies. Our code is available at https://github.com/mayubo2333/PAIE.
In this paper, we formulate a potentially valuable panoramic depth completion (PDC) task as panoramic 3D cameras often produce 360{\deg} depth with missing data in complex scenes. Its goal is to recover dense panoramic depths from raw sparse ones and panoramic RGB images. To deal with the PDC task, we train a deep network that takes both depth and image as inputs for the dense panoramic depth recovery. However, it needs to face a challenging optimization problem of the network parameters due to its non-convex objective function. To address this problem, we propose a simple yet effective approach termed M{^3}PT: multi-modal masked pre-training. Specifically, during pre-training, we simultaneously cover up patches of the panoramic RGB image and sparse depth by shared random mask, then reconstruct the sparse depth in the masked regions. To our best knowledge, it is the first time that we show the effectiveness of masked pre-training in a multi-modal vision task, instead of the single-modal task resolved by masked autoencoders (MAE). Different from MAE where fine-tuning completely discards the decoder part of pre-training, there is no architectural difference between the pre-training and fine-tuning stages in our M$^{3}$PT as they only differ in the prediction density, which potentially makes the transfer learning more convenient and effective. Extensive experiments verify the effectiveness of M{^3}PT on three panoramic datasets. Notably, we improve the state-of-the-art baselines by averagely 26.2% in RMSE, 51.7% in MRE, 49.7% in MAE, and 37.5% in RMSElog on three benchmark datasets. Codes and pre-trained models are available at https://github.com/anonymoustbd/MMMPT.
2D convolution (Conv2d), which is responsible for extracting features from the input image, is one of the key modules of a convolutional neural network (CNN). However, Conv2d is vulnerable to image corruptions and adversarial samples. It is an important yet rarely investigated problem that whether we can design a more robust alternative of Conv2d for more reliable feature extraction. In this paper, inspired by the recently developed learnable sparse transform that learns to convert the CNN features into a compact and sparse latent space, we design a novel building block, denoted by RConv-MK, to strengthen the robustness of extracted convolutional features. Our method leverages a set of learnable kernels of different sizes to extract features at different frequencies and employs a normalized soft thresholding operator to adaptively remove noises and trivial features at different corruption levels. Extensive experiments on clean images, corrupted images as well as adversarial samples validate the effectiveness of the proposed robust module for reliable visual recognition. The source codes are enclosed in the submission.