Abstract:Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.




Abstract:This paper investigates the optimization of the Vehicle Routing Problem for Security Dispatch (VRPSD). VRPSD focuses on security and patrolling applications which involve challenging constraints including precise timing and strict time windows. We propose three algorithms based on different metaheuristics, which are Adaptive Large Neighborhood Search (ALNS), Tabu Search (TS), and Threshold Accepting (TA). The first algorithm combines single-phase ALNS with TA, the second employs a multiphase ALNS with TA, and the third integrates multiphase ALNS, TS, and TA. Experiments are conducted on an instance comprising 251 customer requests. The results demonstrate that the third algorithm, the hybrid multiphase ALNS-TS-TA algorithm, delivers the best performance. This approach simultaneously leverages the large-area search capabilities of ALNS for exploration and effectively escapes local optima when the multiphase ALNS is coupled with TS and TA. Furthermore, in our experiments, the hybrid multiphase ALNS-TS-TA algorithm is the only one that shows potential for improving results with increased computation time across all attempts.
Abstract:Intelligent transportation systems (ITS) use advanced technologies such as artificial intelligence to significantly improve traffic flow management efficiency, and promote the intelligent development of the transportation industry. However, if the data in ITS is attacked, such as tampering or forgery, it will endanger public safety and cause social losses. Therefore, this paper proposes a watermarking that can verify the integrity of copyright in response to the needs of ITS, termed ITSmark. ITSmark focuses on functions such as extracting watermarks, verifying permission, and tracing tampered locations. The scheme uses the copyright information to build the multi-bit space and divides this space into multiple segments. These segments will be assigned to tokens. Thus, the next token is determined by its segment which contains the copyright. In this way, the obtained data contains the custom watermark. To ensure the authorization, key parameters are encrypted during copyright embedding to obtain cipher data. Only by possessing the correct cipher data and private key, can the user entirely extract the watermark. Experiments show that ITSmark surpasses baseline performances in data quality, extraction accuracy, and unforgeability. It also shows unique capabilities of permission verification and tampered location tracing, which ensures the security of extraction and the reliability of copyright verification. Furthermore, ITSmark can also customize the watermark embedding position and proportion according to user needs, making embedding more flexible.




Abstract:With the widespread use of social media, user-generated content has surged on online platforms. When such content includes hateful, abusive, offensive, or cyberbullying behavior, it is classified as toxic speech, posing a significant threat to the online ecosystem's integrity and safety. While manual content moderation is still prevalent, the overwhelming volume of content and the psychological strain on human moderators underscore the need for automated toxic speech detection. Previously proposed detection methods often rely on large annotated datasets; however, acquiring such datasets is both costly and challenging in practice. To address this issue, we propose an uncertainty-guided firewall for toxic speech in few-shot scenarios, U-GIFT, that utilizes self-training to enhance detection performance even when labeled data is limited. Specifically, U-GIFT combines active learning with Bayesian Neural Networks (BNNs) to automatically identify high-quality samples from unlabeled data, prioritizing the selection of pseudo-labels with higher confidence for training based on uncertainty estimates derived from model predictions. Extensive experiments demonstrate that U-GIFT significantly outperforms competitive baselines in few-shot detection scenarios. In the 5-shot setting, it achieves a 14.92\% performance improvement over the basic model. Importantly, U-GIFT is user-friendly and adaptable to various pre-trained language models (PLMs). It also exhibits robust performance in scenarios with sample imbalance and cross-domain settings, while showcasing strong generalization across various language applications. We believe that U-GIFT provides an efficient solution for few-shot toxic speech detection, offering substantial support for automated content moderation in cyberspace, thereby acting as a firewall to promote advancements in cybersecurity.
Abstract:With the evolution of generative linguistic steganography techniques, conventional steganalysis falls short in robustly quantifying the alterations induced by steganography, thereby complicating detection. Consequently, the research paradigm has pivoted towards deep-learning-based linguistic steganalysis. This study offers a comprehensive review of existing contributions and evaluates prevailing developmental trajectories. Specifically, we first provided a formalized exposition of the general formulas for linguistic steganalysis, while comparing the differences between this field and the domain of text classification. Subsequently, we classified the existing work into two levels based on vector space mapping and feature extraction models, thereby comparing the research motivations, model advantages, and other details. A comparative analysis of the experiments is conducted to assess the performances. Finally, the challenges faced by this field are discussed, and several directions for future development and key issues that urgently need to be addressed are proposed.




Abstract:With the increasing complexity of the traffic environment, the significance of safety perception in intelligent driving is intensifying. Traditional methods in the field of intelligent driving perception rely on deep learning, which suffers from limited interpretability, often described as a "black box." This paper introduces a novel type of loss function, termed "Entropy Loss," along with an innovative training strategy. Entropy Loss is formulated based on the functionality of feature compression networks within the perception model. Drawing inspiration from communication systems, the information transmission process in a feature compression network is expected to demonstrate steady changes in information volume and a continuous decrease in information entropy. By modeling network layer outputs as continuous random variables, we construct a probabilistic model that quantifies changes in information volume. Entropy Loss is then derived based on these expectations, guiding the update of network parameters to enhance network interpretability. Our experiments indicate that the Entropy Loss training strategy accelerates the training process. Utilizing the same 60 training epochs, the accuracy of 3D object detection models using Entropy Loss on the KITTI test set improved by up to 4.47\% compared to models without Entropy Loss, underscoring the method's efficacy. The implementation code is available at \url{https://github.com/yhbcode000/Eloss-Interpretability}.
Abstract:Most existing robotic datasets capture static scene data and thus are limited in evaluating robots' dynamic performance. To address this, we present a mobile robot oriented large-scale indoor dataset, denoted as THUD (Tsinghua University Dynamic) robotic dataset, for training and evaluating their dynamic scene understanding algorithms. Specifically, the THUD dataset construction is first detailed, including organization, acquisition, and annotation methods. It comprises both real-world and synthetic data, collected with a real robot platform and a physical simulation platform, respectively. Our current dataset includes 13 larges-scale dynamic scenarios, 90K image frames, 20M 2D/3D bounding boxes of static and dynamic objects, camera poses, and IMU. The dataset is still continuously expanding. Then, the performance of mainstream indoor scene understanding tasks, e.g. 3D object detection, semantic segmentation, and robot relocalization, is evaluated on our THUD dataset. These experiments reveal serious challenges for some robot scene understanding tasks in dynamic scenes. By sharing this dataset, we aim to foster and iterate new mobile robot algorithms quickly for robot actual working dynamic environment, i.e. complex crowded dynamic scenes.




Abstract:To detect stego (steganographic text) in complex scenarios, linguistic steganalysis (LS) with various motivations has been proposed and achieved excellent performance. However, with the development of generative steganography, some stegos have strong concealment, especially after the emergence of LLMs-based steganography, the existing LS has low detection or even cannot detect them. We designed a novel LS with two modes called LSGC. In the generation mode, we created an LS-task "description" and used the generation ability of LLM to explain whether texts to be detected are stegos. On this basis, we rethought the principle of LS and LLMs, and proposed the classification mode. In this mode, LSGC deleted the LS-task "description" and changed the "causalLM" LLMs to the "sequenceClassification" architecture. The LS features can be extracted by only one pass of the model, and a linear layer with initialization weights is added to obtain the classification probability. Experiments on strongly concealed stegos show that LSGC significantly improves detection and reaches SOTA performance. Additionally, LSGC in classification mode greatly reduces training time while maintaining high performance.
Abstract:Considering data insufficiency in metal additive manufacturing (AM), transfer learning (TL) has been adopted to extract knowledge from source domains (e.g., completed printings) to improve the modeling performance in target domains (e.g., new printings). Current applications use all accessible source data directly in TL with no regard to the similarity between source and target data. This paper proposes a systematic method to find appropriate subsets of source data based on similarities between the source and target datasets for a given set of limited target domain data. Such similarity is characterized by the spatial and model distance metrics. A Pareto frontier-based source data selection method is developed, where the source data located on the Pareto frontier defined by two similarity distance metrics are selected iteratively. The method is integrated into an instance-based TL method (decision tree regression model) and a model-based TL method (fine-tuned artificial neural network). Both models are then tested on several regression tasks in metal AM. Comparison results demonstrate that 1) the source data selection method is general and supports integration with various TL methods and distance metrics, 2) compared with using all source data, the proposed method can find a small subset of source data from the same domain with better TL performance in metal AM regression tasks involving different processes and machines, and 3) when multiple source domains exist, the source data selection method could find the subset from one source domain to obtain comparable or better TL performance than the model constructed using data from all source domains.
Abstract:Accurately predicting the temperature field in metal additive manufacturing (AM) processes is critical to preventing overheating, adjusting process parameters, and ensuring process stability. While physics-based computational models offer precision, they are often time-consuming and unsuitable for real-time predictions and online control in iterative design scenarios. Conversely, machine learning models rely heavily on high-quality datasets, which can be costly and challenging to obtain within the metal AM domain. Our work addresses this by introducing a physics-informed neural network framework specifically designed for temperature field prediction in metal AM. This framework incorporates a physics-informed input, physics-informed loss function, and a Convolutional Long Short-Term Memory (ConvLSTM) architecture. Utilizing real-time temperature data from the process, our model predicts 2D temperature fields for future timestamps across diverse geometries, deposition patterns, and process parameters. We validate the proposed framework in two scenarios: full-field temperature prediction for a thin wall and 2D temperature field prediction for cylinder and cubic parts, demonstrating errors below 3% and 1%, respectively. Our proposed framework exhibits the flexibility to be applied across diverse scenarios with varying process parameters, geometries, and deposition patterns.