The visible capability is critical in many robot applications, such as inspection and surveillance, etc. Without the assurance of the visibility to targets, some tasks end up not being complete or even failing. In this paper, we propose a visibility guaranteed planner by star-convex constrained optimization. The visible space is modeled as star convex polytope (SCP) by nature and is generated by finding the visible points directly on point cloud. By exploiting the properties of the SCP, the visibility constraint is formulated for trajectory optimization. The trajectory is confined in the safe and visible flight corridor which consists of convex polytopes and SCPs. We further make a relaxation to the visibility constraints and transform the constrained trajectory optimization problem into an unconstrained one that can be reliably and efficiently solved. To validate the capability of the proposed planner, we present the practical application in site inspection. The experimental results show that the method is efficient, scalable, and visibility guaranteed, presenting the prospect of application to various other applications in the future.
This paper presents a novel Region-Aware Face Swapping (RAFSwap) network to achieve identity-consistent harmonious high-resolution face generation in a local-global manner: \textbf{1)} Local Facial Region-Aware (FRA) branch augments local identity-relevant features by introducing the Transformer to effectively model misaligned cross-scale semantic interaction. \textbf{2)} Global Source Feature-Adaptive (SFA) branch further complements global identity-relevant cues for generating identity-consistent swapped faces. Besides, we propose a \textit{Face Mask Predictor} (FMP) module incorporated with StyleGAN2 to predict identity-relevant soft facial masks in an unsupervised manner that is more practical for generating harmonious high-resolution faces. Abundant experiments qualitatively and quantitatively demonstrate the superiority of our method for generating more identity-consistent high-resolution swapped faces over SOTA methods, \eg, obtaining 96.70 ID retrieval that outperforms SOTA MegaFS by 5.87$\uparrow$.
Mutual localization is essential for coordination and cooperation in multi-robot systems. Previous works have tackled this problem by assuming available correspondences between measurements and received odometry estimations, which are difficult to acquire, especially for unified robot teams. Furthermore, most local optimization methods ask for initial guesses and are sensitive to their quality. In this paper, we present a certifiably optimal algorithm that uses only anonymous bearing measurements to formulate a novel mixed-integer quadratically constrained quadratic problem (MIQCQP). Then, we relax the original nonconvex problem into a semidefinite programming (SDP) problem and obtain a certifiably global optimum using with off-the-shelf solvers. As a result, our method can determine bearing-pose correspondences and furthermore recover the initial relative poses between robots under a certain condition. We compare the performance with local optimization methods on extensive simulations under different noise levels to show our advantage in global optimality and robustness. Real-world experiments are conducted to show the practicality and robustness.
In this work, we propose an interoceptive-only state estimation system for a quadrotor with deep neural network processing, where the quadrotor dynamics is considered as a perceptive supplement of the inertial kinematics. To improve the precision of multi-sensor fusion, we train cascaded networks on real-world quadrotor flight data to learn IMU kinematic properties, quadrotor dynamic characteristics, and motion states of the quadrotor along with their uncertainty information, respectively. This encoded information empowers us to address the issues of IMU bias stability, dynamic constraints, and multi-sensor calibration during sensor fusion. The above multi-source information is fused into a two-stage Extended Kalman Filter (EKF) framework for better estimation. Experiments have demonstrated the advantages of our proposed work over several conventional and learning-based methods.
This paper presents a novel trajectory planning method for aerial perching. Compared with the existing work, the terminal states and the trajectory durations can be adjusted adaptively, instead of being determined in advance. Furthermore, our planner is able to minimize the tangential relative speed on the premise of safety and dynamic feasibility. This feature is especially notable on micro aerial robots with low maneuverability or scenarios where the space is not enough. Moreover, we design a flexible transformation strategy to eliminate terminal constraints along with reducing optimization variables. Besides, we take precise SE(3) motion planning into account to ensure that the drone would not touch the landing platform until the last moment. The proposed method is validated onboard by a palm-sized micro aerial robot with quite limited thrust and moment (thrust-to-weight ratio 1.7) perching on a mobile inclined surface. Sufficient experimental results show that our planner generates an optimal trajectory within 20ms, and replans with warm start in 2ms.
Understanding what objects could furnish for humans-namely, learning object affordance-is the crux to bridge perception and action. In the vision community, prior work primarily focuses on learning object affordance with dense (e.g., at a per-pixel level) supervision. In stark contrast, we humans learn the object affordance without dense labels. As such, the fundamental question to devise a computational model is: What is the natural way to learn the object affordance from visual appearance and geometry with humanlike sparse supervision? In this work, we present a new task of part-level affordance discovery (PartAfford): Given only the affordance labels per object, the machine is tasked to (i) decompose 3D shapes into parts and (ii) discover how each part of the object corresponds to a certain affordance category. We propose a novel learning framework for PartAfford, which discovers part-level representations by leveraging only the affordance set supervision and geometric primitive regularization, without dense supervision. The proposed approach consists of two main components: (i) an abstraction encoder with slot attention for unsupervised clustering and abstraction, and (ii) an affordance decoder with branches for part reconstruction, affordance prediction, and cuboidal primitive regularization. To learn and evaluate PartAfford, we construct a part-level, cross-category 3D object affordance dataset, annotated with 24 affordance categories shared among >25, 000 objects. We demonstrate that our method enables both the abstraction of 3D objects and part-level affordance discovery, with generalizability to difficult and cross-category examples. Further ablations reveal the contribution of each component.
In the practical application of restoring low-resolution gray-scale images, we generally need to run three separate processes of image colorization, super-resolution, and dows-sampling operation for the target device. However, this pipeline is redundant and inefficient for the independent processes, and some inner features could have been shared. Therefore, we present an efficient paradigm to perform {S}imultaneously Image {C}olorization and {S}uper-resolution (SCS) and propose an end-to-end SCSNet to achieve this goal. The proposed method consists of two parts: colorization branch for learning color information that employs the proposed plug-and-play \emph{Pyramid Valve Cross Attention} (PVCAttn) module to aggregate feature maps between source and reference images; and super-resolution branch for integrating color and texture information to predict target images, which uses the designed \emph{Continuous Pixel Mapping} (CPM) module to predict high-resolution images at continuous magnification. Furthermore, our SCSNet supports both automatic and referential modes that is more flexible for practical application. Abundant experiments demonstrate the superiority of our method for generating authentic images over state-of-the-art methods, e.g., averagely decreasing FID by 1.8$\downarrow$ and 5.1 $\downarrow$ compared with current best scores for automatic and referential modes, respectively, while owning fewer parameters (more than $\times$2$\downarrow$) and faster running speed (more than $\times$3$\uparrow$).
Different from traditional convolutional neural network (CNN) and vision transformer, the multilayer perceptron (MLP) is a new kind of vision model with extremely simple architecture that only stacked by fully-connected layers. An input image of vision MLP is usually split into multiple tokens (patches), while the existing MLP models directly aggregate them with fixed weights, neglecting the varying semantic information of tokens from different images. To dynamically aggregate tokens, we propose to represent each token as a wave function with two parts, amplitude and phase. Amplitude is the original feature and the phase term is a complex value changing according to the semantic contents of input images. Introducing the phase term can dynamically modulate the relationship between tokens and fixed weights in MLP. Based on the wave-like token representation, we establish a novel Wave-MLP architecture for vision tasks. Extensive experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art MLP architectures on various vision tasks such as image classification, object detection and semantic segmentation.