We present a novel method, called NeuralUDF, for reconstructing surfaces with arbitrary topologies from 2D images via volume rendering. Recent advances in neural rendering based reconstruction have achieved compelling results. However, these methods are limited to objects with closed surfaces since they adopt Signed Distance Function (SDF) as surface representation which requires the target shape to be divided into inside and outside. In this paper, we propose to represent surfaces as the Unsigned Distance Function (UDF) and develop a new volume rendering scheme to learn the neural UDF representation. Specifically, a new density function that correlates the property of UDF with the volume rendering scheme is introduced for robust optimization of the UDF fields. Experiments on the DTU and DeepFashion3D datasets show that our method not only enables high-quality reconstruction of non-closed shapes with complex typologies, but also achieves comparable performance to the SDF based methods on the reconstruction of closed surfaces.
Neural Radiance Fields (NeRF) have received considerable attention recently, due to its impressive capability in photo-realistic 3D reconstruction and novel view synthesis, given a set of posed camera images. Earlier work usually assumes the input images are in good quality. However, image degradation (e.g. image motion blur in low-light conditions) can easily happen in real-world scenarios, which would further affect the rendering quality of NeRF. In this paper, we present a novel bundle adjusted deblur Neural Radiance Fields (BAD-NeRF), which can be robust to severe motion blurred images and inaccurate camera poses. Our approach models the physical image formation process of a motion blurred image, and jointly learns the parameters of NeRF and recovers the camera motion trajectories during exposure time. In experiments, we show that by directly modeling the real physical image formation process, BAD-NeRF achieves superior performance over prior works on both synthetic and real datasets.
Hierarchical classification (HC) assigns each object with multiple labels organized into a hierarchical structure. The existing deep learning based HC methods usually predict an instance starting from the root node until a leaf node is reached. However, in the real world, images interfered by noise, occlusion, blur, or low resolution may not provide sufficient information for the classification at subordinate levels. To address this issue, we propose a novel semantic guided level-category hybrid prediction network (SGLCHPN) that can jointly perform the level and category prediction in an end-to-end manner. SGLCHPN comprises two modules: a visual transformer that extracts feature vectors from the input images, and a semantic guided cross-attention module that uses categories word embeddings as queries to guide learning category-specific representations. In order to evaluate the proposed method, we construct two new datasets in which images are at a broad range of quality and thus are labeled to different levels (depths) in the hierarchy according to their individual quality. Experimental results demonstrate the effectiveness of our proposed HC method.
Grasping with anthropomorphic robotic hands involves much more hand-object interactions compared to parallel-jaw grippers. Modeling hand-object interactions is essential to the study of multi-finger hand dextrous manipulation. This work presents DVGG, an efficient grasp generation network that takes single-view observation as input and predicts high-quality grasp configurations for unknown objects. In general, our generative model consists of three components: 1) Point cloud completion for the target object based on the partial observation; 2) Diverse sets of grasps generation given the complete point cloud; 3) Iterative grasp pose refinement for physically plausible grasp optimization. To train our model, we build a large-scale grasping dataset that contains about 300 common object models with 1.5M annotated grasps in simulation. Experiments in simulation show that our model can predict robust grasp poses with a wide variety and high success rate. Real robot platform experiments demonstrate that the model trained on our dataset performs well in the real world. Remarkably, our method achieves a grasp success rate of 70.7\% for novel objects in the real robot platform, which is a significant improvement over the baseline methods.
While multi-robot systems have been broadly researched and deployed, their success is built chiefly upon the dependency on network infrastructures, whether wired or wireless. Aiming at the first steps toward de-coupling the application of multi-robot systems from the reliance on network infrastructures, this paper proposes a human-friendly verbal communication platform for multi-robot systems, following the deliberately designed principles of being adaptable, transparent, and secure. The platform is network independent and is subsequently capable of functioning in network infrastructure lacking environments from underwater to planet explorations. A series of experiments were conducted to demonstrate the platform's capability in multi-robot systems communication and task coordination, showing its potential in infrastructure-free applications. To benefit the community, we have made the codes open source at https://github.com/jynxmagic/MSc_AI_project
Security and safety are of paramount importance to human-robot interaction, either for autonomous robots or human-robot collaborative manufacturing. The intertwined relationship between security and safety has imposed new challenges on the emerging digital twin systems of various types of robots. To be specific, the attack of either the cyber-physical system or the digital-twin system could cause severe consequences to the other. Particularly, the attack of a digital-twin system that is synchronized with a cyber-physical system could cause lateral damage to humans and other surrounding facilities. This paper demonstrates that for Robot Operating System (ROS) driven systems, attacks such as the person-in-the-middle attack of the digital-twin system could eventually lead to a collapse of the cyber-physical system, whether it is an industrial robot or an autonomous mobile robot, causing unexpected consequences. We also discuss potential solutions to alleviate such attacks.
Sherds, as the most common artifacts uncovered during archaeological excavations, carry rich information about past human societies so need to be accurately reconstructed and recorded digitally for analysis and preservation. Often hundreds of fragments are uncovered in a day at an archaeological excavation site, far beyond the scanning capacity of existing imaging systems. Hence, there is high demand for a desirable image acquisition system capable of imaging hundreds of fragments per day. In response to this demand, we developed a new system, dubbed FIRES, for Fast Imaging and 3D REconstruction of Sherds. The FIRES system consists of two main components. The first is an optimally designed fast image acquisition device capable of capturing over 700 sherds per day (in 8 working hours) in actual tests at an excavation site, which is one order-of-magnitude faster than existing systems. The second component is an automatic pipeline for 3D reconstruction of the sherds from the images captured by the imaging acquisition system, achieving reconstruction accuracy of 0.16 milimeters. The pipeline includes a novel batch matching algorithm that matches partial 3D scans of the front and back sides of the sherds and a new ICP-type method that registers the front and back sides sharing very narrow overlapping regions. Extensive validation in labs and testing in excavation sites demonstrated that our FIRES system provides the first fast, accurate, portal, and cost-effective solution for the task of imaging and 3D reconstruction of sherds in archaeological excavations.
To do dimensionality reduction on the datasets with outliers, the $\ell_1$-norm principal component analysis (L1-PCA) as a typical robust alternative of the conventional PCA has enjoyed great popularity over the past years. In this work, we consider a rotationally invariant L1-PCA, which is hardly studied in the literature. To tackle it, we propose a proximal alternating linearized minimization method with a nonlinear extrapolation for solving its two-block reformulation. Moreover, we show that the proposed method converges at least linearly to a limiting critical point of the reformulated problem. Such a point is proved to be a critical point of the original problem under a condition imposed on the step size. Finally, we conduct numerical experiments on both synthetic and real datasets to support our theoretical developments and demonstrate the efficacy of our approach.
When training overparameterized deep networks for classification tasks, it has been widely observed that the learned features exhibit a so-called "neural collapse" phenomenon. More specifically, for the output features of the penultimate layer, for each class the within-class features converge to their means, and the means of different classes exhibit a certain tight frame structure, which is also aligned with the last layer's classifier. As feature normalization in the last layer becomes a common practice in modern representation learning, in this work we theoretically justify the neural collapse phenomenon for normalized features. Based on an unconstrained feature model, we simplify the empirical loss function in a multi-class classification task into a nonconvex optimization problem over the Riemannian manifold by constraining all features and classifiers over the sphere. In this context, we analyze the nonconvex landscape of the Riemannian optimization problem over the product of spheres, showing a benign global landscape in the sense that the only global minimizers are the neural collapse solutions while all other critical points are strict saddles with negative curvature. Experimental results on practical deep networks corroborate our theory and demonstrate that better representations can be learned faster via feature normalization.
A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the proposed method (named Levenshtein OCR, and LevOCR for short) explores an alternative way for automatically transcribing textual content from cropped natural images. Specifically, we cast the problem of scene text recognition as an iterative sequence refinement process. The initial prediction sequence produced by a pure vision model is encoded and fed into a cross-modal transformer to interact and fuse with the visual features, to progressively approximate the ground truth. The refinement process is accomplished via two basic character-level operations: deletion and insertion, which are learned with imitation learning and allow for parallel decoding, dynamic length change and good interpretability. The quantitative experiments clearly demonstrate that LevOCR achieves state-of-the-art performances on standard benchmarks and the qualitative analyses verify the effectiveness and advantage of the proposed LevOCR algorithm. Code will be released soon.