Clouds play a critical role in the Earth's energy budget and their potential changes are one of the largest uncertainties in future climate projections. However, the use of satellite observations to understand cloud feedbacks in a warming climate has been hampered by the simplicity of existing cloud classification schemes, which are based on single-pixel cloud properties rather than utilizing spatial structures and textures. Recent advances in computer vision enable the grouping of different patterns of images without using human-predefined labels, providing a novel means of automated cloud classification. This unsupervised learning approach allows discovery of unknown climate-relevant cloud patterns, and the automated processing of large datasets. We describe here the use of such methods to generate a new AI-driven Cloud Classification Atlas (AICCA), which leverages 22 years and 800 terabytes of MODIS satellite observations over the global ocean. We use a rotation-invariant cloud clustering (RICC) method to classify those observations into 42 AI-generated cloud class labels at ~100 km spatial resolution. As a case study, we use AICCA to examine a recent finding of decreasing cloudiness in a critical part of the subtropical stratocumulus deck, and show that the change is accompanied by strong trends in cloud classes.
Object packing by autonomous robots is an im-portant challenge in warehouses and logistics industry. Most conventional data-driven packing planning approaches focus on regular cuboid packing, which are usually heuristic and limit the practical use in realistic applications with everyday objects. In this paper, we propose a deep hierarchical reinforcement learning approach to simultaneously plan packing sequence and placement for irregular object packing. Specifically, the top manager network infers packing sequence from six principal view heightmaps of all objects, and then the bottom worker network receives heightmaps of the next object to predict the placement position and orientation. The two networks are trained hierarchically in a self-supervised Q-Learning framework, where the rewards are provided by the packing results based on the top height , object volume and placement stability in the box. The framework repeats sequence and placement planning iteratively until all objects have been packed into the box or no space is remained for unpacked items. We compare our approach with existing robotic packing methods for irregular objects in a physics simulator. Experiments show that our approach can pack more objects with less time cost than the state-of-the-art packing methods of irregular objects. We also implement our packing plan with a robotic manipulator to show the generalization ability in the real world.
Object classification using LiDAR 3D point cloud data is critical for modern applications such as autonomous driving. However, labeling point cloud data is labor-intensive as it requires human annotators to visualize and inspect the 3D data from different perspectives. In this paper, we propose a semi-supervised cross-domain learning approach that does not rely on manual annotations of point clouds and performs similar to fully-supervised approaches. We utilize available 3D object models to train classifiers that can generalize to real-world point clouds. We simulate the acquisition of point clouds by sampling 3D object models from multiple viewpoints and with arbitrary partial occlusions. We then augment the resulting set of point clouds through random rotations and adding Gaussian noise to better emulate the real-world scenarios. We then train point cloud encoding models, e.g., DGCNN, PointNet++, on the synthesized and augmented datasets and evaluate their cross-domain classification performance on corresponding real-world datasets. We also introduce Point-Syn2Real, a new benchmark dataset for cross-domain learning on point clouds. The results of our extensive experiments with this dataset demonstrate that the proposed cross-domain learning approach for point clouds outperforms the related baseline and state-of-the-art approaches in both indoor and outdoor settings in terms of cross-domain generalizability. The code and data will be available upon publishing.
After a survey for person-tracking system-induced privacy concerns, we propose a black-box adversarial attack method on state-of-the-art human detection models called InvisibiliTee. The method learns printable adversarial patterns for T-shirts that cloak wearers in the physical world in front of person-tracking systems. We design an angle-agnostic learning scheme which utilizes segmentation of the fashion dataset and a geometric warping process so the adversarial patterns generated are effective in fooling person detectors from all camera angles and for unseen black-box detection models. Empirical results in both digital and physical environments show that with the InvisibiliTee on, person-tracking systems' ability to detect the wearer drops significantly.
Explaining deep convolutional neural networks has been recently drawing increasing attention since it helps to understand the networks' internal operations and why they make certain decisions. Saliency maps, which emphasize salient regions largely connected to the network's decision-making, are one of the most common ways for visualizing and analyzing deep networks in the computer vision community. However, saliency maps generated by existing methods cannot represent authentic information in images due to the unproven proposals about the weights of activation maps which lack solid theoretical foundation and fail to consider the relations between each pixel. In this paper, we develop a novel post-hoc visual explanation method called Shap-CAM based on class activation mapping. Unlike previous gradient-based approaches, Shap-CAM gets rid of the dependence on gradients by obtaining the importance of each pixel through Shapley value. We demonstrate that Shap-CAM achieves better visual performance and fairness for interpreting the decision making process. Our approach outperforms previous methods on both recognition and localization tasks.
Recognizing objects in dense clutter accurately plays an important role to a wide variety of robotic manipulation tasks including grasping, packing, rearranging and many others. However, conventional visual recognition models usually miss objects because of the significant occlusion among instances and causes incorrect prediction due to the visual ambiguity with the high object crowdedness. In this paper, we propose an interactive exploration framework called Smart Explorer for recognizing all objects in dense clutters. Our Smart Explorer physically interacts with the clutter to maximize the recognition performance while minimize the number of motions, where the false positives and negatives can be alleviated effectively with the optimal accuracy-efficiency trade-offs. Specifically, we first collect the multi-view RGB-D images of the clutter and reconstruct the corresponding point cloud. By aggregating the instance segmentation of RGB images across views, we acquire the instance-wise point cloud partition of the clutter through which the existed classes and the number of objects for each class are predicted. The pushing actions for effective physical interaction are generated to sizably reduce the recognition uncertainty that consists of the instance segmentation entropy and multi-view object disagreement. Therefore, the optimal accuracy-efficiency trade-off of object recognition in dense clutter is achieved via iterative instance prediction and physical interaction. Extensive experiments demonstrate that our Smart Explorer acquires promising recognition accuracy with only a few actions, which also outperforms the random pushing by a large margin.
Event cameras are bio-inspired dynamic vision sensors that respond to changes in image intensity with a high temporal resolution, high dynamic range and low latency. These sensor characteristics are ideally suited to enable visual target tracking in concert with a broadcast visual communication channel for smart visual beacons with applications in distributed robotics. Visual beacons can be constructed by high-frequency modulation of Light Emitting Diodes (LEDs) such as vehicle headlights, Internet of Things (IoT) LEDs, smart building lights, etc., that are already present in many real-world scenarios. The high temporal resolution characteristic of the event cameras allows them to capture visual signals at far higher data rates compared to classical frame-based cameras. In this paper, we propose a novel smart visual beacon architecture with both LED modulation and event camera demodulation algorithms. We quantitatively evaluate the relationship between LED transmission rate, communication distance and the message transmission accuracy for the smart visual beacon communication system that we prototyped. The proposed method achieves up to 4 kbps in an indoor environment and lossless transmission over a distance of 100 meters, at a transmission rate of 500 bps, in full sunlight, demonstrating the potential of the technology in an outdoor environment.
Grasping in dense clutter is a fundamental skill for autonomous robots. However, the crowdedness and occlusions in the cluttered scenario cause significant difficulties to generate valid grasp poses without collisions, which results in low efficiency and high failure rates. To address these, we present a generic framework called GE-Grasp for robotic motion planning in dense clutter, where we leverage diverse action primitives for occluded object removal and present the generator-evaluator architecture to avoid spatial collisions. Therefore, our GE-Grasp is capable of grasping objects in dense clutter efficiently with promising success rates. Specifically, we define three action primitives: target-oriented grasping for target capturing, pushing, and nontarget-oriented grasping to reduce the crowdedness and occlusions. The generators effectively provide various action candidates referring to the spatial information. Meanwhile, the evaluators assess the selected action primitive candidates, where the optimal action is implemented by the robot. Extensive experiments in simulated and real-world environments show that our approach outperforms the state-of-the-art methods of grasping in clutter with respect to motion efficiency and success rates. Moreover, we achieve comparable performance in the real world as that in the simulation environment, which indicates the strong generalization ability of our GE-Grasp. Supplementary material is available at: https://github.com/CaptainWuDaoKou/GE-Grasp.