Autonomous vehicles have been actively investigated over the past few decades. Several recent works show the potential of autonomous driving transportation services in urban environments with impressive experimental results. However, these works note that autonomous vehicles are still occasionally inferior to expert drivers in complex scenarios. Furthermore, they do not focus on the possibilities of autonomous driving transportation services in other areas beyond urban environments. This paper presents the research results and lessons learned from autonomous driving transportation services in airfield, crowded indoor, and urban environments. We discuss how we address several unique challenges in these diverse environments. We also offer an overview of remaining challenges that have not received much attention but must be addressed. This paper aims to share our unique experience to support researchers who are interested in realizing the potential of autonomous vehicles in various real-world environments.
Whole slide image (WSI) classification requires repetitive zoom-in and out for pathologists, as only small portions of the slide may be relevant to detecting cancer. Due to the lack of patch-level labels, multiple instance learning (MIL) is a common practice for training a WSI classifier. One of the challenges in MIL for WSIs is the weak supervision coming only from the slide-level labels, often resulting in severe overfitting. In response, researchers have considered adopting patch-level augmentation or applying mixup augmentation, but their applicability remains unverified. Our approach augments the training dataset by sampling a subset of patches in the WSI without significantly altering the underlying semantics of the original slides. Additionally, we introduce an efficient model (Slot-MIL) that organizes patches into a fixed number of slots, the abstract representation of patches, using an attention mechanism. We empirically demonstrate that the subsampling augmentation helps to make more informative slots by restricting the over-concentration of attention and to improve interpretability. Finally, we illustrate that combining our attention-based aggregation model with subsampling and mixup, which has shown limited compatibility in existing MIL methods, can enhance both generalization and calibration. Our proposed methods achieve the state-of-the-art performance across various benchmark datasets including class imbalance and distribution shifts.
Physical human-robot interactions (pHRIs) can improve robot autonomy and reduce physical demands on humans. In this paper, we consider a collaborative task with a considerably long object and no prior knowledge of the object's parameters. An integrated control framework with an online object parameter estimator and a Cartesian object-aware impedance controller is proposed to realize complicated scenarios. During the transportation task, the object parameters are estimated online while a robot and human lift an object. The perturbation motion is incorporated into the null space of the desired trajectory to enhance the estimator accuracy. An object-aware impedance controller is designed using the real-time estimation results to effectively transmit the intended human motion to the robot through the object. Experimental demonstrations of collaborative tasks, including object transportation and assembly tasks, are implemented to show the effectiveness of our proposed method.
Large-scale image generation models, with impressive quality made possible by the vast amount of data available on the Internet, raise social concerns that these models may generate harmful or copyrighted content. The biases and harmfulness arise throughout the entire training process and are hard to completely remove, which have become significant hurdles to the safe deployment of these models. In this paper, we propose a method called SDD to prevent problematic content generation in text-to-image diffusion models. We self-distill the diffusion model to guide the noise estimate conditioned on the target removal concept to match the unconditional one. Compared to the previous methods, our method eliminates a much greater proportion of harmful content from the generated images without degrading the overall image quality. Furthermore, our method allows the removal of multiple concepts at once, whereas previous works are limited to removing a single concept at a time.
Planning multi-contact motions in a receding horizon fashion requires a value function to guide the planning with respect to the future, e.g., building momentum to traverse large obstacles. Traditionally, the value function is approximated by computing trajectories in a prediction horizon (never executed) that foresees the future beyond the execution horizon. However, given the non-convex dynamics of multi-contact motions, this approach is computationally expensive. To enable online Receding Horizon Planning (RHP) of multi-contact motions, we find efficient approximations of the value function. Specifically, we propose a trajectory-based and a learning-based approach. In the former, namely RHP with Multiple Levels of Model Fidelity, we approximate the value function by computing the prediction horizon with a convex relaxed model. In the latter, namely Locally-Guided RHP, we learn an oracle to predict local objectives for locomotion tasks, and we use these local objectives to construct local value functions for guiding a short-horizon RHP. We evaluate both approaches in simulation by planning centroidal trajectories of a humanoid robot walking on moderate slopes, and on large slopes where the robot cannot maintain static balance. Our results show that locally-guided RHP achieves the best computation efficiency (95\%-98.6\% cycles converge online). This computation advantage enables us to demonstrate online receding horizon planning of our real-world humanoid robot Talos walking in dynamic environments that change on-the-fly.
In urban areas, dense buildings frequently block and reflect global positioning system (GPS) signals, resulting in the reception of a few visible satellites with many multipath signals. This is a significant problem that results in unreliable positioning in urban areas. If a signal reception condition from a certain satellite can be detected, the positioning performance can be improved by excluding or de-weighting the multipath contaminated satellite signal. Thus, we developed a machine-learning-based method of classifying GPS signal reception conditions using a dual-polarized antenna. We employed a decision tree algorithm for classification using three features, one of which can be obtained only from a dual-polarized antenna. A machine-learning model was trained using GPS signals collected from various locations. When the features extracted from the GPS raw signal are input, the generated machine-learning model outputs one of the three signal reception conditions: non-line-of-sight (NLOS) only, line-of-sight (LOS) only, or LOS+NLOS. Multiple testing datasets were used to analyze the classification accuracy, which was then compared with an existing method using dual single-polarized antennas. Consequently, when the testing dataset was collected at different locations from the training dataset, a classification accuracy of 64.47% was obtained, which was slightly higher than the accuracy of the existing method using dual single-polarized antennas. Therefore, the dual-polarized antenna solution is more beneficial than the dual single-polarized antenna solution because it has a more compact form factor and its performance is similar to that of the other solution.
Recent state-of-the-art methods for HOI detection typically build on transformer architectures with two decoder branches, one for human-object pair detection and the other for interaction classification. Such disentangled transformers, however, may suffer from insufficient context exchange between the branches and lead to a lack of context information for relational reasoning, which is critical in discovering HOI instances. In this work, we propose the multiplex relation network (MUREN) that performs rich context exchange between three decoder branches using unary, pairwise, and ternary relations of human, object, and interaction tokens. The proposed method learns comprehensive relational contexts for discovering HOI instances, achieving state-of-the-art performance on two standard benchmarks for HOI detection, HICO-DET and V-COCO.
Scene graph generation aims to construct a semantic graph structure from an image such that its nodes and edges respectively represent objects and their relationships. One of the major challenges for the task lies in the presence of distracting objects and relationships in images; contextual reasoning is strongly distracted by irrelevant objects or backgrounds and, more importantly, a vast number of irrelevant candidate relations. To tackle the issue, we propose the Selective Quad Attention Network (SQUAT) that learns to select relevant object pairs and disambiguate them via diverse contextual interactions. SQUAT consists of two main components: edge selection and quad attention. The edge selection module selects relevant object pairs, i.e., edges in the scene graph, which helps contextual reasoning, and the quad attention module then updates the edge features using both edge-to-node and edge-to-edge cross-attentions to capture contextual information between objects and object pairs. Experiments demonstrate the strong performance and robustness of SQUAT, achieving the state of the art on the Visual Genome and Open Images v6 benchmarks.
In this study, the global positioning system (GPS) multipath detection was performed based on the carrier-to-noise-density ratio, C/N0, measured through a dual-polarized antenna. As the right hand circular polarization (RHCP) antenna is sensitive to the signals directly received from the GPS, and the left hand circular polarization (LHCP) antenna is sensitive to the singly reflected signals, the C/N0 difference between the RHCP and LHCP measurements is used for multipath detection. Once we collected the GPS signals in a low multipath location, we calculated the C/N0 difference to obtain a threshold value that can be used to detect the multipath GPS signal received from another location. The results were validated through a ray-tracing simulation.
The Global Positioning System (GPS) has become the most widely used positioning, navigation, and timing system. However, the vulnerability of GPS to radio frequency interference has attracted significant attention. After experiencing several incidents of intentional high-power GPS jamming trials by North Korea, South Korea decided to deploy the enhanced long-range navigation (eLoran) system, which is a high-power terrestrial radio-navigation system that can complement GPS. As the first phase of the South Korean eLoran program, an eLoran testbed system was recently developed and declared operational on June 1, 2021. Once its operational performance is determined to be satisfactory, South Korea plans to move to the second phase of the program, which is a nationwide eLoran system. For the optimal deployment of additional eLoran transmitters in a nationwide system, it is necessary to properly simulate the expected positioning accuracy of the said future system. In this study, we propose enhanced eLoran accuracy simulation methods based on a land cover map and transmitter jitter estimation. Using actual measurements over the country, the simulation accuracy of the proposed methods was confirmed to be approximately 10%-91% better than that of the existing Loran (i.e., Loran-C and eLoran) positioning accuracy simulators depending on the test locations.