Hand segmentation is a crucial task in first-person vision. Since first-person images exhibit strong bias in appearance among different environments, adapting a pre-trained segmentation model to a new domain is required in hand segmentation. Here, we focus on appearance gaps for hand regions and backgrounds separately. We propose (i) foreground-aware image stylization and (ii) consensus pseudo-labeling for domain adaptation of hand segmentation. We stylize source images independently for the foreground and background using target images as style. To resolve the domain shift that the stylization has not addressed, we apply careful pseudo-labeling by taking a consensus between the models trained on the source and stylized source images. We validated our method on domain adaptation of hand segmentation from real and simulation images. Our method achieved state-of-the-art performance in both settings. We also demonstrated promising results in challenging multi-target domain adaptation and domain generalization settings. Code is available at https://github.com/ut-vision/FgSty-CPL.
In this report, we describe the technical details of our submission to the 2021 EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition. Leveraging multiple modalities has been proved to benefit the Unsupervised Domain Adaptation (UDA) task. In this work, we present Multi-Modal Mutual Enhancement Module (M3EM), a deep module for jointly considering information from multiple modalities to find the most transferable representations across domains. We achieve this by implementing two sub-modules for enhancing each modality using the context of other modalities. The first sub-module exchanges information across modalities through the semantic space, while the second sub-module finds the most transferable spatial region based on the consensus of all modalities.
People spend an enormous amount of time and effort looking for lost objects. To help remind people of the location of lost objects, various computational systems that provide information on their locations have been developed. However, prior systems for assisting people in finding objects require users to register the target objects in advance. This requirement imposes a cumbersome burden on the users, and the system cannot help remind them of unexpectedly lost objects. We propose GO-Finder ("Generic Object Finder"), a registration-free wearable camera based system for assisting people in finding an arbitrary number of objects based on two key features: automatic discovery of hand-held objects and image-based candidate selection. Given a video taken from a wearable camera, Go-Finder automatically detects and groups hand-held objects to form a visual timeline of the objects. Users can retrieve the last appearance of the object by browsing the timeline through a smartphone app. We conducted a user study to investigate how users benefit from using GO-Finder and confirmed improved accuracy and reduced mental load regarding the object search task by providing clear visual cues on object locations.
Identifying and visualizing regions that are significant for a given deep neural network model, i.e., attribution methods, is still a vital but challenging task, especially for spatio-temporal networks that process videos as input. Albeit some methods that have been proposed for video attribution, it is yet to be studied what types of network structures each video attribution method is suitable for. In this paper, we provide a comprehensive study of the existing video attribution methods of two categories, gradient-based and perturbation-based, for visual explanation of neural networks that take videos as the input (spatio-temporal networks). To perform this study, we extended a perturbation-based attribution method from 2D (images) to 3D (videos) and validated its effectiveness by mathematical analysis and experiments. For a more comprehensive analysis of existing video attribution methods, we introduce objective metrics that are complementary to existing subjective ones. Our experimental results indicate that attribution methods tend to show opposite performances on objective and subjective metrics.
[Objective] To develop a CAD system for hip fracture for plain frontal hip X-ray by CNN trained on a large dataset collected at multiple institutions. And, the possibility of the diagnosis rate improvement of the proximal femoral fracture by the resident using this CAD system as an aid of the diagnosis. [Materials and methods] In total, 4851 cases of hip fracture patients who visited each institution between 2009 and 2019 were included. 5242 plain pelvic X-rays were extracted from a DICOM server, and a total of 10484 images(5242 with fracture and 5242 without fracture) were used for machine learning. A CNN approach was used. We used the EffectiventNet-B4 framework with Pytorch 1.3 and this http URL 1.0. In the final evaluation, accuracy, sensitivity, specificity, F-value, and AUC were evaluated. Grad-CAM was used to conceptualize the basis of the diagnosis by the CAD system. For 31 residents and 4 orthopedic surgeons, the image diagnosis test was carried out for 600 photographs of hip fracture randomly extracted from test image data set. And, diagnosis rate in the situation with/without the diagnosis support by the CAD system were evaluated respectively. [Results] The diagnostic accuracy of the learning model was 96.1%, sensitivity 95.2%, specificity 96.9%, F value 0.961, and AUC 0.99. Grad-CAM was used to show the most accurate diagnosis. In the image diagnosis test, the resident acquired the diagnostic ability equivalent to that of the orthopedic surgeon by using the diagnostic aid of the CAD system. [Conclusions] The CAD system using AI for the hip fracture which we developed could offer the diagnosis reason, and it became an image diagnosis tool with the high diagnosis accuracy. And, the possibility of contributing to the diagnosis rate improvement was considered in the field of actual clinical environment such as emergency room.
[Objective] To develop a CAD system for proximal femoral fracture for plain frontal hip radiographs by CNN trained on a large dataset collected at multiple institutions. And, the possibility of the diagnosis rate improvement of the proximal femoral fracture by the resident using this CAD system as an aid of the diagnosis. [Materials and methods] In total, 4851 cases of proximal femoral fracture patients who visited each institution between 2009 and 2019 were included. 5242 plain pelvic radiographs were extracted from a DICOM server, and a total of 10484 images(5242 with fracture and 5242 without fracture) were used for machine learning. A CNN approach was used. We used the EffectiventNet-B4 framework with Pytorch 1.3 and Fast.ai 1.0. In the final evaluation, accuracy, sensitivity, specificity, F-value, and AUC were evaluated. Grad-CAM was used to conceptualize the basis of the diagnosis by the CAD system. For 31 residents and 4 orthopedic surgeons, the image diagnosis test was carried out for 600 photographs of proximal femoral fracture randomly extracted from test image data set. And, diagnosis rate in the situation with/without the diagnosis support by the CAD system were evaluated respectively. [Results] The diagnostic accuracy of the learning model was 96.1%, sensitivity 95.2%, specificity 96.9%, F value 0.961, and AUC 0.99. Grad-CAM was used to show the most accurate diagnosis. In the image diagnosis test, the resident acquired the diagnostic ability equivalent to that of the orthopedic surgeon by using the diagnostic aid of the CAD system. [Conclusions] The CAD system using AI for the proximal femoral fracture which we developed could offer the diagnosis reason, and it became an image diagnosis tool with the high diagnosis accuracy. And, the possibility of contributing to the diagnosis rate improvement was considered in the field of actual clinical environment such as emergency room.
Recent advances in computer vision have made it possible to automatically assess from videos the manipulation skills of humans in performing a task, which has many important applications in domains such as health rehabilitation and manufacturing. However, previous methods used all video appearance as input and did not consider the attention mechanism humans use in assessing videos, which may limit their performance since only a part of video regions is critical for skill assessment. Our motivation here is to model human attention in videos that helps to focus on most relevant video regions for better skill assessment. In particular, we propose a novel deep model that learns spatial attention automatically from videos in an end-to-end manner. We evaluate our approach on a newly collected dataset of infant grasping task and four existing datasets of hand manipulation tasks. Experiment results demonstrate that state-of-the-art performance can be achieved by considering attention in automatic skill assessment.