The Transformer is a highly successful deep learning model that has revolutionised the world of artificial neural networks, first in natural language processing and later in computer vision. This model is based on the attention mechanism and is able to capture complex semantic relationships between a variety of patterns present in the input data. Precisely because of these characteristics, the Transformer has recently been exploited for time series forecasting problems, assuming a natural adaptability to the domain of continuous numerical series. Despite the acclaimed results in the literature, some works have raised doubts about the robustness and effectiveness of this approach. In this paper, we further investigate the effectiveness of Transformer-based models applied to the domain of time series forecasting, demonstrate their limitations, and propose a set of alternative models that are better performing and significantly less complex. In particular, we empirically show how simplifying Transformer-based forecasting models almost always leads to an improvement, reaching state of the art performance. We also propose shallow models without the attention mechanism, which compete with the overall state of the art in long time series forecasting, and demonstrate their ability to accurately predict time series over extremely long windows. From a methodological perspective, we show how it is always necessary to use a simple baseline to verify the effectiveness of proposed models, and finally, we conclude the paper with a reflection on recent research paths and the opportunity to follow trends and hypes even where it may not be necessary.
A kriging-random forest hybrid model is developed for real-time ground property prediction ahead of the earth pressure balanced shield by integrating Kriging extrapolation and random forest, which can guide shield operating parameter selection thereby mitigate construction risks. The proposed KRF algorithm synergizes two types of information: prior information and real-time information. The previously predicted ground properties with EPB operating parameters are extrapolated via the Kriging algorithm to provide prior information for the prediction of currently being excavated ground properties. The real-time information refers to the real-time operating parameters of the EPB shield, which are input into random forest to provide a real-time prediction of ground properties. The integration of these two predictions is achieved by assigning weights to each prediction according to their uncertainties, ensuring the prediction of KRF with minimum uncertainty. The performance of the KRF algorithm is assessed via a case study of the Changsha Metro Line 4 project. It reveals that the proposed KRF algorithm can predict ground properties with an accuracy of 93%, overperforming the existing algorithms of LightGBM, AdaBoost-CART, and DNN by 29%, 8%, and 12%, respectively. Another dataset from Shenzhen Metro Line 13 project is utilized to further evaluate the model generalization performance, revealing that the model can transfer its learned knowledge from one region to another with an accuracy of 89%.
Relative localization is crucial for multi-robot systems to perform cooperative tasks, especially in GPS-denied environments. Current techniques for multi-robot relative localization rely on expensive or short-range sensors such as cameras and LIDARs. As a result, these algorithms face challenges such as high computational complexity, dependencies on well-structured environments, etc. To overcome these limitations, we propose a new distributed approach to perform relative localization using a Gaussian Processes map of the Radio Signal Strength Indicator (RSSI) values from a single wireless Access Point (AP) to which the robots are connected. Our approach, Gaussian Processes-based Relative Localization (GPRL), combines two pillars. First, the robots locate the AP w.r.t. their local reference frames using novel hierarchical inferencing that significantly reduces computational complexity. Secondly, the robots obtain relative positions of neighbor robots with an AP-oriented vector transformation. The approach readily applies to resource-constrained devices and relies only on the ubiquitously-available RSSI measurement. We extensively validate the performance of the two pillars of the proposed GRPL in Robotarium simulations. We also demonstrate the applicability of GPRL through a multi-robot rendezvous task with a team of three real-world robots. The results demonstrate that GPRL outperformed state-of-the-art approaches regarding accuracy, computation, and real-time performance.
This work addresses human intention identification during physical Human-Robot Interaction (pHRI) tasks to include this information in an assistive controller. To this purpose, human intention is defined as the desired trajectory that the human wants to follow over a finite rolling prediction horizon so that the robot can assist in pursuing it. This work investigates a Recurrent Neural Network (RNN), specifically, Long-Short Term Memory (LSTM) cascaded with a Fully Connected layer. In particular, we propose an iterative training procedure to adapt the model. Such an iterative procedure is powerful in reducing the prediction error. Still, it has the drawback that it is time-consuming and does not generalize to different users or different co-manipulated objects. To overcome this issue, Transfer Learning (TL) adapts the pre-trained model to new trajectories, users, and co-manipulated objects by freezing the LSTM layer and fine-tuning the last FC layer, which makes the procedure faster. Experiments show that the iterative procedure adapts the model and reduces prediction error. Experiments also show that TL adapts to different users and to the co-manipulation of a large object. Finally, to check the utility of adopting the proposed method, we compare the proposed controller enhanced by the intention prediction with the other two standard controllers of pHRI.
Deep neural networks (DNN) have become a common sensing modality in autonomous systems as they allow for semantically perceiving the ambient environment given input images. Nevertheless, DNN models have proven to be vulnerable to adversarial digital and physical attacks. To mitigate this issue, several detection frameworks have been proposed to detect whether a single input image has been manipulated by adversarial digital noise or not. In our prior work, we proposed a real-time detector, called VisionGuard (VG), for adversarial physical attacks against single input images to DNN models. Building upon that work, we propose VisionGuard* (VG), which couples VG with majority-vote methods, to detect adversarial physical attacks in time-series image data, e.g., videos. This is motivated by autonomous systems applications where images are collected over time using onboard sensors for decision-making purposes. We emphasize that majority-vote mechanisms are quite common in autonomous system applications (among many other applications), as e.g., in autonomous driving stacks for object detection. In this paper, we investigate, both theoretically and experimentally, how this widely used mechanism can be leveraged to enhance the performance of adversarial detectors. We have evaluated VG* on videos of both clean and physically attacked traffic signs generated by a state-of-the-art robust physical attack. We provide extensive comparative experiments against detectors that have been designed originally for out-of-distribution data and digitally attacked images.
Large vision-language models have achieved outstanding performance, but their size and computational requirements make their deployment on resource-constrained devices and time-sensitive tasks impractical. Model distillation, the process of creating smaller, faster models that maintain the performance of larger models, is a promising direction towards the solution. This paper investigates the distillation of visual representations in large teacher vision-language models into lightweight student models using a small- or mid-scale dataset. Notably, this study focuses on open-vocabulary out-of-distribution (OOD) generalization, a challenging problem that has been overlooked in previous model distillation literature. We propose two principles from vision and language modality perspectives to enhance student's OOD generalization: (1) by better imitating teacher's visual representation space, and carefully promoting better coherence in vision-language alignment with the teacher; (2) by enriching the teacher's language representations with informative and finegrained semantic attributes to effectively distinguish between different labels. We propose several metrics and conduct extensive experiments to investigate their techniques. The results demonstrate significant improvements in zero-shot and few-shot student performance on open-vocabulary out-of-distribution classification, highlighting the effectiveness of our proposed approaches. Code released at https://github.com/xuanlinli17/large_vlm_distillation_ood
Procedurally generating cities in Minecraft provides players more diverse scenarios and could help understand and improve the design of cities in other digital worlds and the real world. This paper presents a city generator that was submitted as an entry to the 2023 Edition of Minecraft Settlement Generation Competition for Minecraft. The generation procedure is composed of six main steps, namely vegetation clearing, terrain reshaping, building layout generation, route planning, streetlight placement, and wall construction. Three algorithms, including a heuristic-based algorithm, an evolving layout algorithm, and a random one are applied to generate the building layout, thus determining where to place different redstone style buildings, and tested by generating cities on random maps in limited time. Experimental results show that the heuristic-based algorithm is capable of finding an acceptable building layout faster for flat maps, while the evolving layout algorithm performs better in evolving layout for rugged maps. A user study is conducted to compare our generator with outstanding entries of the competition's 2022 edition using the competition's evaluation criteria and shows that our generator performs well in the adaptation and functionality criteria
Visual (re)localization is critical for various applications in computer vision and robotics. Its goal is to estimate the 6 degrees of freedom (DoF) camera pose for each query image, based on a set of posed database images. Currently, all leading solutions are structure-based that either explicitly construct 3D metric maps from the database with structure-from-motion, or implicitly encode the 3D information with scene coordinate regression models. On the contrary, visual localization without reconstructing the scene in 3D offers clear benefits. It makes deployment more convenient by reducing database pre-processing time, releasing storage requirements, and remaining unaffected by imperfect reconstruction, etc. In this technical report, we demonstrate that it is possible to achieve high localization accuracy without reconstructing the scene from the database. The key to achieving this owes to a tailored motion averaging over database-query pairs. Experiments show that our visual localization proposal, LazyLoc, achieves comparable performance against state-of-the-art structure-based methods. Furthermore, we showcase the versatility of LazyLoc, which can be easily extended to handle complex configurations such as multi-query co-localization and camera rigs.
Vision systems mounted on home robots need to interact with unseen classes in changing environments. Robots have limited computational resources, labelled data and storage capability. These requirements pose some unique challenges: models should adapt without forgetting past knowledge in a data- and parameter-efficient way. We characterize the problem as few-shot (FS) online continual learning (OCL), where robotic agents learn from a non-repeated stream of few-shot data updating only a few model parameters. Additionally, such models experience variable conditions at test time, where objects may appear in different poses (e.g., horizontal or vertical) and environments (e.g., day or night). To improve robustness of CL agents, we propose RobOCLe, which; 1) constructs an enriched feature space computing high order statistical moments from the embedded features of samples; and 2) computes similarity between high order statistics of the samples on the enriched feature space, and predicts their class labels. We evaluate robustness of CL models to train/test augmentations in various cases. We show that different moments allow RobOCLe to capture different properties of deformations, providing higher robustness with no decrease of inference speed.
The research field of Information Retrieval (IR) has evolved significantly, expanding beyond traditional search to meet diverse user information needs. Recently, Large Language Models (LLMs) have demonstrated exceptional capabilities in text understanding, generation, and knowledge inference, opening up exciting avenues for IR research. LLMs not only facilitate generative retrieval but also offer improved solutions for user understanding, model evaluation, and user-system interactions. More importantly, the synergistic relationship among IR models, LLMs, and humans forms a new technical paradigm that is more powerful for information seeking. IR models provide real-time and relevant information, LLMs contribute internal knowledge, and humans play a central role of demanders and evaluators to the reliability of information services. Nevertheless, significant challenges exist, including computational costs, credibility concerns, domain-specific limitations, and ethical considerations. To thoroughly discuss the transformative impact of LLMs on IR research, the Chinese IR community conducted a strategic workshop in April 2023, yielding valuable insights. This paper provides a summary of the workshop's outcomes, including the rethinking of IR's core values, the mutual enhancement of LLMs and IR, the proposal of a novel IR technical paradigm, and open challenges.