We present a method for deadlock-free and collision-free navigation in a multi-robot system with nonholonomic robots. The problem is solved by quadratic programming and is applicable to most wheeled mobile robots with linear kinematic constraints. We introduce masked velocity and Masked Cooperative Collision Avoidance (MCCA) algorithm to encourage a fully decentralized deadlock avoidance behavior. To verify the method, we provide a detailed implementation and introduce heading oscillation avoidance for differential-drive robots. To the best of our knowledge, it is the first method to give very promising and stable results for deadlock avoidance even in situations with a large number of robots and narrow passages.
Interactive Image Segmentation (IIS) has emerged as a promising technique for decreasing annotation time. Substantial progress has been made in pre- and post-processing for IIS, but the critical issue of interaction ambiguity notably hindering segmentation quality, has been under-researched. To address this, we introduce AdaptiveClick -- a clicks-aware transformer incorporating an adaptive focal loss, which tackles annotation inconsistencies with tools for mask- and pixel-level ambiguity resolution. To the best of our knowledge, AdaptiveClick is the first transformer-based, mask-adaptive segmentation framework for IIS. The key ingredient of our method is the Clicks-aware Mask-adaptive Transformer Decoder (CAMD), which enhances the interaction between clicks and image features. Additionally, AdaptiveClick enables pixel-adaptive differentiation of hard and easy samples in the decision space, independent of their varying distributions. This is primarily achieved by optimizing a generalized Adaptive Focal Loss (AFL) with a theoretical guarantee, where two adaptive coefficients control the ratio of gradient values for hard and easy pixels. Our analysis reveals that the commonly used Focal and BCE losses can be considered special cases of the proposed AFL loss. With a plain ViT backbone, extensive experimental results on nine datasets demonstrate the superiority of AdaptiveClick compared to state-of-the-art methods. Code will be publicly available at https://github.com/lab206/AdaptiveClick.
A semantic map of the road scene, covering fundamental road elements, is an essential ingredient in autonomous driving systems. It provides important perception foundations for positioning and planning when rendered in the Bird's-Eye-View (BEV). Currently, the prior knowledge of hypothetical depth can guide the learning of translating front perspective views into BEV directly with the help of calibration parameters. However, it suffers from geometric distortions in the representation of distant objects. In addition, another stream of methods without prior knowledge can learn the transformation between front perspective views and BEV implicitly with a global view. Considering that the fusion of different learning methods may bring surprising beneficial effects, we propose a Bi-Mapper framework for top-down road-scene semantic understanding, which incorporates a global view and local prior knowledge. To enhance reliable interaction between them, an asynchronous mutual learning strategy is proposed. At the same time, an Across-Space Loss (ASL) is designed to mitigate the negative impact of geometric distortions. Extensive results on nuScenes and Cam2BEV datasets verify the consistent effectiveness of each module in the proposed Bi-Mapper framework. Compared with exiting road mapping networks, the proposed Bi-Mapper achieves 5.0 higher IoU on the nuScenes dataset. Moreover, we verify the generalization performance of Bi-Mapper in a real-world driving scenario. Code will be available at https://github.com/lynn-yu/Bi-Mapper.
This paper proposes a neural network-based user simulator that can provide a multimodal interactive environment for training Reinforcement Learning (RL) agents in collaborative tasks involving multiple modes of communication. The simulator is trained on the existing ELDERLY-AT-HOME corpus and accommodates multiple modalities such as language, pointing gestures, and haptic-ostensive actions. The paper also presents a novel multimodal data augmentation approach, which addresses the challenge of using a limited dataset due to the expensive and time-consuming nature of collecting human demonstrations. Overall, the study highlights the potential for using RL and multimodal user simulators in developing and improving domestic assistive robots.
Robot assistants for older adults and people with disabilities need to interact with their users in collaborative tasks. The core component of these systems is an interaction manager whose job is to observe and assess the task, and infer the state of the human and their intent to choose the best course of action for the robot. Due to the sparseness of the data in this domain, the policy for such multi-modal systems is often crafted by hand; as the complexity of interactions grows this process is not scalable. In this paper, we propose a reinforcement learning (RL) approach to learn the robot policy. In contrast to the dialog systems, our agent is trained with a simulator developed by using human data and can deal with multiple modalities such as language and physical actions. We conducted a human study to evaluate the performance of the system in the interaction with a user. Our designed system shows promising preliminary results when it is used by a real user.
Recently, ChatGPT, along with DALL-E-2 and Codex,has been gaining significant attention from society. As a result, many individuals have become interested in related resources and are seeking to uncover the background and secrets behind its impressive performance. In fact, ChatGPT and other Generative AI (GAI) techniques belong to the category of Artificial Intelligence Generated Content (AIGC), which involves the creation of digital content, such as images, music, and natural language, through AI models. The goal of AIGC is to make the content creation process more efficient and accessible, allowing for the production of high-quality content at a faster pace. AIGC is achieved by extracting and understanding intent information from instructions provided by human, and generating the content according to its knowledge and the intent information. In recent years, large-scale models have become increasingly important in AIGC as they provide better intent extraction and thus, improved generation results. With the growth of data and the size of the models, the distribution that the model can learn becomes more comprehensive and closer to reality, leading to more realistic and high-quality content generation. This survey provides a comprehensive review on the history of generative models, and basic components, recent advances in AIGC from unimodal interaction and multimodal interaction. From the perspective of unimodality, we introduce the generation tasks and relative models of text and image. From the perspective of multimodality, we introduce the cross-application between the modalities mentioned above. Finally, we discuss the existing open problems and future challenges in AIGC.
Oriented normals are common pre-requisites for many geometric algorithms based on point clouds, such as Poisson surface reconstruction. However, it is not trivial to obtain a consistent orientation. In this work, we bridge orientation and reconstruction in implicit space and propose a novel approach to orient point clouds by incorporating isovalue constraints to the Poisson equation. Feeding a well-oriented point cloud into a reconstruction approach, the indicator function values of the sample points should be close to the isovalue. Based on this observation and the Poisson equation, we propose an optimization formulation that combines isovalue constraints with local consistency requirements for normals. We optimize normals and implicit functions simultaneously and solve for a globally consistent orientation. Owing to the sparsity of the linear system, an average laptop can be used to run our method within reasonable time. Experiments show that our method can achieve high performance in non-uniform and noisy data and manage varying sampling densities, artifacts, multiple connected components, and nested surfaces.
An important problem in microrobotics is how to control a large group of microrobots with a global control signal. This paper focuses on controlling a large-scale swarm of MicroStressBots with on-board physical finite-state machines. We introduce the concept of group-based control, which makes it possible to scale up the swarm size while reducing the complexity both of robot fabrication as well as swarm control. We prove that the group-based control system is locally accessible in terms of the robot positions. We further hypothesize based on extensive simulations that the system is globally controllable. A nonlinear optimization strategy is proposed to control the swarm by minimizing control effort. We also propose a probabilistically complete collision avoidance method that is suitable for online use. The paper concludes with an evaluation of the proposed methods in simulations.
Building footprints data is of importance in several urban applications and natural disaster management. In contrast to traditional surveying and mapping, using high spatial resolution aerial images, deep learning-based building footprints extraction methods can extract building footprints accurately and efficiently. With rapidly development of deep learning methods, it is hard for novice to harness the powerful tools in building footprints extraction. The paper aims at providing the whole process of building footprints extraction from high spatial resolution images using deep learning-based methods. In addition, we also compare the commonly used methods, including Fully Convolutional Networks (FCN)-8s, U-Net and DeepLabv3+. At the end of the work, we change the data size used in models training to explore the influence of data size to the performance of the algorithms. The experiments show that, in different data size, DeepLabv3+ is the best algorithm among them with the highest accuracy and moderate efficiency; FCN-8s has the worst accuracy and highest efficiency; U-Net shows the moderate accuracy and lowest efficiency. In addition, with more training data, algorithms converged faster with higher accuracy in extraction results.