Offloading computationally heavy tasks from an unmanned aerial vehicle (UAV) to a remote server helps improve the battery life and can help reduce resource requirements. Deep learning based state-of-the-art computer vision tasks, such as object segmentation and object detection, are computationally heavy algorithms, requiring large memory and computing power. Many UAVs are using (pretrained) off-the-shelf versions of such algorithms. Offloading such power-hungry algorithms to a remote server could help UAVs save power significantly. However, deep learning based algorithms are susceptible to noise, and a wireless communication system, by its nature, introduces noise to the original signal. When the signal represents an image, noise affects the image. There has not been much work studying the effect of the noise introduced by the communication system on pretrained deep networks. In this work, we first analyze how reliable it is to offload deep learning based computer vision tasks (including both object segmentation and detection) by focusing on the effect of various parameters of a 5G wireless communication system on the transmitted image and demonstrate how the introduced noise of the used 5G wireless communication system reduces the performance of the offloaded deep learning task. Then solutions are introduced to eliminate (or reduce) the negative effect of the noise. The proposed framework starts with introducing many classical techniques as alternative solutions first, and then introduces a novel deep learning based solution to denoise the given noisy input image. The performance of various denoising algorithms on offloading both object segmentation and object detection tasks are compared. Our proposed deep transformer-based denoiser algorithm (NR-Net) yields the state-of-the-art results on reducing the negative effect of the noise in our experiments.
Taking full advantage of the excellent performance of StyleGAN, style transfer-based face swapping methods have been extensively investigated recently. However, these studies require separate face segmentation and blending modules for successful face swapping, and the fixed selection of the manipulated latent code in these works is reckless, thus degrading face swapping quality, generalizability, and practicability. This paper proposes a novel and end-to-end integrated framework for high resolution and attribute preservation face swapping via Adaptive Latent Representation Learning. Specifically, we first design a multi-task dual-space face encoder by sharing the underlying feature extraction network to simultaneously complete the facial region perception and face encoding. This encoder enables us to control the face pose and attribute individually, thus enhancing the face swapping quality. Next, we propose an adaptive latent codes swapping module to adaptively learn the mapping between the facial attributes and the latent codes and select effective latent codes for improved retention of facial attributes. Finally, the initial face swapping image generated by StyleGAN2 is blended with the facial region mask generated by our encoder to address the background blur problem. Our framework integrating facial perceiving and blending into the end-to-end training and testing process can achieve high realistic face-swapping on wild faces without segmentation masks. Experimental results demonstrate the superior performance of our approach over state-of-the-art methods.
Needle picking is a challenging surgical task in robot-assisted surgery due to the characteristics of small slender shapes of needles, needles' variations in shapes and sizes, and demands for millimeter-level control. Prior works, heavily relying on the prior of needles (e.g., geometric models), are hard to scale to unseen needles' variations. In addition, visual tracking errors can not be minimized online using their approaches. In this paper, we propose an end-to-end deep visual learning framework for needle-picking tasks where both visual and control components can be learned jointly online. Our proposed framework integrates a state-of-the-art reinforcement learning framework, Dreamer, with behavior cloning (BC). Besides, two novel techniques, i.e., Virtual Clutch and Dynamic Spotlight Adaptation (DSA), are introduced to our end-to-end visual controller for needle-picking tasks. We conducted extensive experiments in simulation to evaluate the performance, robustness, variation adaptation, and effectiveness of individual components of our method. Our approach, trained by 8k demonstration timesteps and 140k online policy timesteps, can achieve a remarkable success rate of 80%, a new state-of-the-art with end-to-end vision-based surgical robot learning for delicate operations tasks. Furthermore, our method effectively demonstrated its superiority in generalization to unseen dynamic scenarios with needle variations and image disturbance, highlighting its robustness and versatility. Codes and videos are available at https://sites.google.com/view/dreamerbc.
Simulating high-resolution detector responses is a storage-costly and computationally intensive process that has long been challenging in particle physics. Despite the ability of deep generative models to make this process more cost-efficient, ultra-high-resolution detector simulation still proves to be difficult as it contains correlated and fine-grained mutual information within an event. To overcome these limitations, we propose Intra-Event Aware GAN (IEA-GAN), a novel fusion of Self-Supervised Learning and Generative Adversarial Networks. IEA-GAN presents a Relational Reasoning Module that approximates the concept of an ''event'' in detector simulation, allowing for the generation of correlated layer-dependent contextualized images for high-resolution detector responses with a proper relational inductive bias. IEA-GAN also introduces a new intra-event aware loss and a Uniformity loss, resulting in significant enhancements to image fidelity and diversity. We demonstrate IEA-GAN's application in generating sensor-dependent images for the high-granularity Pixel Vertex Detector (PXD), with more than 7.5M information channels and a non-trivial geometry, at the Belle II Experiment. Applications of this work include controllable simulation-based inference and event generation, high-granularity detector simulation such as at the HL-LHC (High Luminosity LHC), and fine-grained density estimation and sampling. To the best of our knowledge, IEA-GAN is the first algorithm for faithful ultra-high-resolution detector simulation with event-based reasoning.
Citizen science is gaining popularity as a valuable tool for labelling large collections of astronomical images by the general public. This is often achieved at the cost of poorer quality classifications made by amateur participants, which are usually verified by employing smaller data sets labelled by professional astronomers. Despite its success, citizen science alone will not be able to handle the classification of current and upcoming surveys. To alleviate this issue, citizen science projects have been coupled with machine learning techniques in pursuit of a more robust automated classification. However, existing approaches have neglected the fact that, apart from the data labelled by amateurs, (limited) expert knowledge of the problem is also available along with vast amounts of unlabelled data that have not yet been exploited within a unified learning framework. This paper presents an innovative learning paradigm for citizen science capable of taking advantage of expert- and amateur-labelled data, and unlabelled data. The proposed methodology first learns from unlabelled data with a convolutional autoencoder and then exploits amateur and expert labels via the pre-training and fine-tuning of a convolutional neural network, respectively. We focus on the classification of galaxy images from the Galaxy Zoo project, from which we test binary, multi-class, and imbalanced classification scenarios. The results demonstrate that our solution is able to improve classification performance compared to a set of baseline approaches, deploying a promising methodology for learning from different confidence levels in data labelling.
The majority of human deaths and injuries are caused by traffic accidents. A million people worldwide die each year due to traffic accident injuries, consistent with the World Health Organization. Drivers who do not receive enough sleep, rest, or who feel weary may fall asleep behind the wheel, endangering both themselves and other road users. The research on road accidents specified that major road accidents occur due to drowsiness while driving. These days, it is observed that tired driving is the main reason to occur drowsiness. Now, drowsiness becomes the main principle for to increase in the number of road accidents. This becomes a major issue in a world which is very important to resolve as soon as possible. The predominant goal of all devices is to improve the performance to detect drowsiness in real time. Many devices were developed to detect drowsiness, which depend on different artificial intelligence algorithms. So, our research is also related to driver drowsiness detection which can identify the drowsiness of a driver by identifying the face and then followed by eye tracking. The extracted eye image is matched with the dataset by the system. With the help of the dataset, the system detected that if eyes were close for a certain range, it could ring an alarm to alert the driver and if the eyes were open after the alert, then it could continue tracking. If the eyes were open then the score that we set decreased and if the eyes were closed then the score increased. This paper focus to resolve the problem of drowsiness detection with an accuracy of 80% and helps to reduce road accidents.
Pre-training vision-language models with contrastive objectives has shown promising results that are both scalable to large uncurated datasets and transferable to many downstream applications. Some following works have targeted to improve data efficiency by adding self-supervision terms, but inter-domain (image-text) contrastive loss and intra-domain (image-image) contrastive loss are defined on individual spaces in those works, so many feasible combinations of supervision are overlooked. To overcome this issue, we propose UniCLIP, a Unified framework for Contrastive Language-Image Pre-training. UniCLIP integrates the contrastive loss of both inter-domain pairs and intra-domain pairs into a single universal space. The discrepancies that occur when integrating contrastive loss between different domains are resolved by the three key components of UniCLIP: (1) augmentation-aware feature embedding, (2) MP-NCE loss, and (3) domain dependent similarity measure. UniCLIP outperforms previous vision-language pre-training methods on various single- and multi-modality downstream tasks. In our experiments, we show that each component that comprises UniCLIP contributes well to the final performance.
The analysis of FFPE tissue sections stained with haematoxylin and eosin (H&E) or immunohistochemistry (IHC) is an essential part of the pathologic assessment of surgically resected breast cancer specimens. IHC staining has been broadly adopted into diagnostic guidelines and routine workflows to manually assess status and scoring of several established biomarkers, including ER, PGR, HER2 and KI67. However, this is a task that can also be facilitated by computational pathology image analysis methods. The research in computational pathology has recently made numerous substantial advances, often based on publicly available whole slide image (WSI) data sets. However, the field is still considerably limited by the sparsity of public data sets. In particular, there are no large, high quality publicly available data sets with WSIs of matching IHC and H&E-stained tissue sections. Here, we publish the currently largest publicly available data set of WSIs of tissue sections from surgical resection specimens from female primary breast cancer patients with matched WSIs of corresponding H&E and IHC-stained tissue, consisting of 4,212 WSIs from 1,153 patients. The primary purpose of the data set was to facilitate the ACROBAT WSI registration challenge, aiming at accurately aligning H&E and IHC images. For research in the area of image registration, automatic quantitative feedback on registration algorithm performance remains available through the ACROBAT challenge website, based on more than 37,000 manually annotated landmark pairs from 13 annotators. Beyond registration, this data set has the potential to enable many different avenues of computational pathology research, including stain-guided learning, virtual staining, unsupervised pre-training, artefact detection and stain-independent models.
Robot vision is greatly affected by occlusions, which poses challenges to autonomous systems. The robot itself may hide targets of interest from the camera, while it moves within the field of view, leading to failures in task execution. For example, if a target of interest is partially occluded by the robot, detecting and grasping it correctly, becomes very challenging. To solve this problem, we propose a computationally lightweight method to determine the areas that the robot occludes. For this purpose, we use the Unified Robot Description Format (URDF) to generate a virtual depth image of the 3D robot model. Using the virtual depth image, we can effectively determine the partially occluded areas to improve the robustness of the information given by the perception system. Due to the real-time capabilities of the method, it can successfully detect occlusions of moving targets by the moving robot. We validate the effectiveness of the method in an experimental setup using a 6-DoF robot arm and an RGB-D camera by detecting and handling occlusions for two tasks: Pose estimation of a moving object for pickup and human tracking for robot handover. The code is available in \url{https://github.com/auth-arl/virtual\_depth\_image}.
Diffusion models show promising generation capability for a variety of data. Despite their high generation quality, the inference for diffusion models is still time-consuming due to the numerous sampling iterations required. To accelerate the inference, we propose ReDi, a simple yet learning-free Retrieval-based Diffusion sampling framework. From a precomputed knowledge base, ReDi retrieves a trajectory similar to the partially generated trajectory at an early stage of generation, skips a large portion of intermediate steps, and continues sampling from a later step in the retrieved trajectory. We theoretically prove that the generation performance of ReDi is guaranteed. Our experiments demonstrate that ReDi improves the model inference efficiency by 2x speedup. Furthermore, ReDi is able to generalize well in zero-shot cross-domain image generation such as image stylization.