Skeleton-based action recognition is widely used in varied areas, e.g., surveillance and human-machine interaction. Existing models are mainly learned in a supervised manner, thus heavily depending on large-scale labeled data which could be infeasible when labels are prohibitively expensive. In this paper, we propose a novel Contrast-Reconstruction Representation Learning network (CRRL) that simultaneously captures postures and motion dynamics for unsupervised skeleton-based action recognition. It mainly consists of three parts: Sequence Reconstructor, Contrastive Motion Learner, and Information Fuser. The Sequence Reconstructor learns representation from skeleton coordinate sequence via reconstruction, thus the learned representation tends to focus on trivial postural coordinates and be hesitant in motion learning. To enhance the learning of motions, the Contrastive Motion Learner performs contrastive learning between the representations learned from coordinate sequence and additional velocity sequence, respectively. Finally, in the Information Fuser, we explore varied strategies to combine the Sequence Reconstructor and Contrastive Motion Learner, and propose to capture postures and motions simultaneously via a knowledge-distillation based fusion strategy that transfers the motion learning from the Contrastive Motion Learner to the Sequence Reconstructor. Experimental results on several benchmarks, i.e., NTU RGB+D 60, NTU RGB+D 120, CMU mocap, and NW-UCLA, demonstrate the promise of the proposed CRRL method by far outperforming state-of-the-art approaches.
Distributed resource allocation (RA) schemes have been introduced in cellular vehicle-to-everything (C-V2X) standard for vehicle-to-vehicle (V2V) sidelink (SL) communications to share the limited spectrum (sub-6GHz) efficiently. However, the recent progress in connected and automated vehicles and mobility services requires a huge amount of available spectrum resources. Therefore, millimeter-wave and sub-THz frequencies are being considered as they offer a large free bandwidth. However, they require beamforming techniques to compensate for the higher path loss attenuation. The current fifth-generation (5G) RA standard for SL communication is inherited from the previous C-V2X standard, which is not suited for beam-based communication since it does not explore the spatial dimension. In this context, we propose a novel RA scheme that addresses the directional component by adding this third spatial dimension to the bandwidth part structure and promotes cooperation between vehicles in resource selection, namely cooperative three-dimensional RA. Numerical results show an average of 10% improvement in packet delivery ratio, an average 50% decrease in collision probability, and a 30% better channel busy ratio compared to the current standard, thus, confirming the validity of the proposed method.
The evolution of connected and automated vehicles (CAVs) technology is boosting the development of innovative solutions for the sixth generation (6G) of Vehicular-to-Everything (V2X) networks. Lower frequency networks provide control of millimeter waves (mmWs) or sub-THz beam-based 6G communications. In CAVs, the mmW/Sub-THz guarantees a huge amount of bandwidth (>1GHz) and a high data rate (> 10 Gbit/s), enhancing the safety of CAVs applications. However, high-frequency is impaired by severe path-loss, and line of sight (LoS) propagation can be easily blocked. Static and dynamic blocking (e.g., by non-connected vehicles) heavily affects V2X links, and thus, in a multi-vehicular case, the knowledge of LoS (or visibility) mapping is mandatory for stable connections and proactive beam pointing that might involve relays whenever necessary. In this paper, we design a criterion for dynamic LoS-map estimation, and we propose a novel framework for relay of opportunity selection to enable high-quality and stable V2X links. Relay selection is based on cooperative sensing to cope with LoS blockage conditions. LoS-map is dynamically estimated on top of the static map of the environment by merging the perceptive sensors' data to achieve cooperative awareness of the surrounding scenario. Multiple relay selection architectures are based on centralized and decentralized strategies. 3GPP standard-compliant simulation is the framework methodology adopted herein to reproduce real-world urban vehicular environments and vehicles' mobility patterns.
Neural Architecture Search (NAS) has shown great potential in effectively reducing manual effort in network design by automatically discovering optimal architectures. What is noteworthy is that as of now, object detection is less touched by NAS algorithms despite its significant importance in computer vision. To the best of our knowledge, most of the recent NAS studies on object detection tasks fail to satisfactorily strike a balance between performance and efficiency of the resulting models, let alone the excessive amount of computational resources cost by those algorithms. Here we propose an efficient method to obtain better object detectors by searching for the feature pyramid network (FPN) as well as the prediction head of a simple anchor-free object detector, namely, FCOS [36], using a tailored reinforcement learning paradigm. With carefully designed search space, search algorithms, and strategies for evaluating network quality, we are able to find top-performing detection architectures within 4 days using 8 V100 GPUs. The discovered architectures surpass state-of-the-art object detection models (such as Faster R-CNN, Retina-Net and, FCOS) by 1.0% to 5.4% points in AP on the COCO dataset, with comparable computation complexity and memory footprint, demonstrating the efficacy of the proposed NAS method for object detection. Code is available at https://github.com/Lausannen/NAS-FCOS.
We propose StyleNeRF, a 3D-aware generative model for photo-realistic high-resolution image synthesis with high multi-view consistency, which can be trained on unstructured 2D images. Existing approaches either cannot synthesize high-resolution images with fine details or yield noticeable 3D-inconsistent artifacts. In addition, many of them lack control over style attributes and explicit 3D camera poses. StyleNeRF integrates the neural radiance field (NeRF) into a style-based generator to tackle the aforementioned challenges, i.e., improving rendering efficiency and 3D consistency for high-resolution image generation. We perform volume rendering only to produce a low-resolution feature map and progressively apply upsampling in 2D to address the first issue. To mitigate the inconsistencies caused by 2D upsampling, we propose multiple designs, including a better upsampler and a new regularization loss. With these designs, StyleNeRF can synthesize high-resolution images at interactive rates while preserving 3D consistency at high quality. StyleNeRF also enables control of camera poses and different levels of styles, which can generalize to unseen views. It also supports challenging tasks, including zoom-in and-out, style mixing, inversion, and semantic editing.
Objectives: to propose a fully-automatic computer-aided diagnosis (CAD) solution for liver lesion characterization, with uncertainty estimation. Methods: we enrolled 400 patients who had either liver resection or a biopsy and was diagnosed with either hepatocellular carcinoma (HCC), intrahepatic cholangiocarcinoma, or secondary metastasis, from 2006 to 2019. Each patient was scanned with T1WI, T2WI, T1WI venous phase (T2WI-V), T1WI arterial phase (T1WI-A), and DWI MRI sequences. We propose a fully-automatic deep CAD pipeline that localizes lesions from 3D MRI studies using key-slice parsing and provides a confidence measure for its diagnoses. We evaluate using five-fold cross validation and compare performance against three radiologists, including a senior hepatology radiologist, a junior hepatology radiologist and an abdominal radiologist. Results: the proposed CAD solution achieves a mean F1 score of 0.62, outperforming the abdominal radiologist (0.47), matching the junior hepatology radiologist (0.61), and underperforming the senior hepatology radiologist (0.68). The CAD system can informatively assess its diagnostic confidence, i.e., when only evaluating on the 70% most confident cases the mean f1 score and sensitivity at 80% specificity for HCC vs. others are boosted from 0.62 to 0.71 and 0.84 to 0.92, respectively. Conclusion: the proposed fully-automatic CAD solution can provide good diagnostic performance with informative confidence assessments in finding and discriminating liver lesions from MRI studies.
We propose a semi-supervised network for wide-angle portraits correction. Wide-angle images often suffer from skew and distortion affected by perspective distortion, especially noticeable at the face regions. Previous deep learning based approaches require the ground-truth correction flow maps for the training guidance. However, such labels are expensive, which can only be obtained manually. In this work, we propose a semi-supervised scheme, which can consume unlabeled data in addition to the labeled data for improvements. Specifically, our semi-supervised scheme takes the advantages of the consistency mechanism, with several novel components such as direction and range consistency (DRC) and regression consistency (RC). Furthermore, our network, named as Multi-Scale Swin-Unet (MS-Unet), is built upon the multi-scale swin transformer block (MSTB), which can learn both local-scale and long-range semantic information effectively. In addition, we introduce a high-quality unlabeled dataset with rich scenarios for the training. Extensive experiments demonstrate that the proposed method is superior over the state-of-the-art methods and other representative baselines.
Recent advances have enabled a single neural network to serve as an implicit scene representation, establishing the mapping function between spatial coordinates and scene properties. In this paper, we make a further step towards continual learning of the implicit scene representation directly from sequential observations, namely Continual Neural Mapping. The proposed problem setting bridges the gap between batch-trained implicit neural representations and commonly used streaming data in robotics and vision communities. We introduce an experience replay approach to tackle an exemplary task of continual neural mapping: approximating a continuous signed distance function (SDF) from sequential depth images as a scene geometry representation. We show for the first time that a single network can represent scene geometry over time continually without catastrophic forgetting, while achieving promising trade-offs between accuracy and efficiency.
Advanced service robots require superior tactile intelligence to guarantee human-contact safety and to provide essential supplements to visual and auditory information for human-robot interaction, especially when a robot is in physical contact with a human. Tactile intelligence is an essential capability of perception and recognition from tactile information, based on the learning from a large amount of tactile data and the understanding of the physical meaning behind the data. This report introduces a recently collected and organized dataset "TacAct" that encloses real-time pressure distribution when a human subject touches the arms of a nursing-care robot. The dataset consists of information from 50 subjects who performed a total of 24,000 touch actions. Furthermore, the details of the dataset are described, the data are preliminarily analyzed, and the validity of the collected information is tested through a convolutional neural network LeNet-5 classifying different types of touch actions. We believe that the TacAct dataset would be more than beneficial for the community of human interactive robots to understand the tactile profile under various circumstances.
Human touching the robot to convey intentions or emotions is an essential communication pathway during physical Human-Robot Interaction (pHRI). Therefore, advanced service robots require superior tactile intelligence to guarantee naturalness and safety when making physical contact with human subjects. Tactile intelligence is the capability to percept and recognize tactile information from touch behaviors, in which understanding the physical meaning of touching actions is crucial. For this purpose, this report introduces a recently collected and organized dataset "TacAct" that encloses real-time tactile information when human subjects touched the test device mimicking a robot forearm. The dataset contains 12 types of 24,000 touch actions from 50 subjects. The dataset details are described, the data are preliminarily analyzed, and the validity of the dataset is tested through a convolutional neural network LeNet-5 which classifying different types of touch actions. We believe that the TacAct dataset would be beneficial for the community to understand the touch intention under various circumstances and to develop learning-based intelligent algorithms for different applications.