Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a deep neural network (DNN) model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning. We demonstrate the general applicability and efficiency of OneFlow for training various large DNN models with case studies and extensive experiments. The results show that OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks. The code of OneFlow is available at: https://github.com/Oneflow-Inc/oneflow.
This paper proposes an unsupervised cross-modality domain adaptation approach based on pixel alignment and self-training. Pixel alignment transfers ceT1 scans to hrT2 modality, helping to reduce domain shift in the training segmentation model. Self-training adapts the decision boundary of the segmentation network to fit the distribution of hrT2 scans. Experiment results show that PAST has outperformed the non-UDA baseline significantly, and it received rank-2 on CrossMoDA validation phase Leaderboard with a mean Dice score of 0.8395.
Mitochondria segmentation in electron microscopy images is essential in neuroscience. However, due to the image degradation during the imaging process, the large variety of mitochondrial structures, as well as the presence of noise, artifacts and other sub-cellular structures, mitochondria segmentation is very challenging. In this paper, we propose a novel and effective contrastive learning framework to learn a better feature representation from hard examples to improve segmentation. Specifically, we adopt a point sampling strategy to pick out representative pixels from hard examples in the training phase. Based on these sampled pixels, we introduce a pixel-wise label-based contrastive loss which consists of a similarity loss term and a consistency loss term. The similarity term can increase the similarity of pixels from the same class and the separability of pixels from different classes in feature space, while the consistency term is able to enhance the sensitivity of the 3D model to changes in image content from frame to frame. We demonstrate the effectiveness of our method on MitoEM dataset as well as FIB-SEM dataset and show better or on par with state-of-the-art results.
Deep learning based visual trackers entail offline pre-training on large volumes of video datasets with accurate bounding box annotations that are labor-expensive to achieve. We present a new framework to facilitate bounding box annotations for video sequences, which investigates a selection-and-refinement strategy to automatically improve the preliminary annotations generated by tracking algorithms. A temporal assessment network (T-Assess Net) is proposed which is able to capture the temporal coherence of target locations and select reliable tracking results by measuring their quality. Meanwhile, a visual-geometry refinement network (VG-Refine Net) is also designed to further enhance the selected tracking results by considering both target appearance and temporal geometry constraints, allowing inaccurate tracking results to be corrected. The combination of the above two networks provides a principled approach to ensure the quality of automatic video annotation. Experiments on large scale tracking benchmarks demonstrate that our method can deliver highly accurate bounding box annotations and significantly reduce human labor by 94.0%, yielding an effective means to further boost tracking performance with augmented training data.
Blood vessel segmentation is crucial for many diagnostic and research applications. In recent years, CNN-based models have leaded to breakthroughs in the task of segmentation, however, such methods usually lose high-frequency information like object boundaries and subtle structures, which are vital to vessel segmentation. To tackle this issue, we propose Boundary Enhancement and Feature Denoising (BEFD) module to facilitate the network ability of extracting boundary information in semantic segmentation, which can be integrated into arbitrary encoder-decoder architecture in an end-to-end way. By introducing Sobel edge detector, the network is able to acquire additional edge prior, thus enhancing boundary in an unsupervised manner for medical image segmentation. In addition, we also utilize a denoising block to reduce the noise hidden in the low-level features. Experimental results on retinal vessel dataset and angiocarpy dataset demonstrate the superior performance of the new BEFD module.
In federated learning (FL), fair and accurate measurement of the contribution of each federated participant is of great significance. The level of contribution not only provides a rational metric for distributing financial benefits among federated participants, but also helps to discover malicious participants that try to poison the FL framework. Previous methods for contribution measurement were based on enumeration over possible combination of federated participants. Their computation costs increase drastically with the number of participants or feature dimensions, making them inapplicable in practical situations. In this paper an efficient method is proposed to evaluate the contributions of federated participants. This paper focuses on the horizontal FL framework, where client servers calculate parameter gradients over their local data, and upload the gradients to the central server. Before aggregating the client gradients, the central server train a data value estimator of the gradients using reinforcement learning techniques. As shown by experimental results, the proposed method consistently outperforms the conventional leave-one-out method in terms of valuation authenticity as well as time complexity.
Code retrieval is a key task aiming to match natural and programming languages. In this work, we propose adversarial learning for code retrieval, that is regularized by question-description relevance. First, we adapt a simple adversarial learning technique to generate difficult code snippets given the input question, which can help the learning of code retrieval that faces bi-modal and data-scarce challenges. Second, we propose to leverage question-description relevance to regularize adversarial learning, such that a generated code snippet should contribute more to the code retrieval training loss, only if its paired natural language description is predicted to be less relevant to the user given question. Experiments on large-scale code retrieval datasets of two programming languages show that our adversarial learning method is able to improve the performance of state-of-the-art models. Moreover, using an additional duplicate question prediction model to regularize adversarial learning further improves the performance, and this is more effective than using the duplicated questions in strong multi-task learning baselines
In this study, we propose a novel RGB-T tracking framework by jointly modeling both appearance and motion cues. First, to obtain a robust appearance model, we develop a novel late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities. The fusion weights are determined by using offline-trained global and local multimodal fusion networks, and then adopted to linearly combine the response maps of RGB and T modalities. Second, when the appearance cue is unreliable, we comprehensively take motion cues, i.e., target and camera motions, into account to make the tracker robust. We further propose a tracker switcher to switch the appearance and motion trackers flexibly. Numerous results on three recent RGB-T tracking datasets show that the proposed tracker performs significantly better than other state-of-the-art algorithms.
This paper makes one of the first efforts toward automatically generating complex questions from knowledge graphs. Particularly, we study how to leverage existing simple question datasets for this task, under two separate scenarios: using either sub-questions of the target complex questions, or distantly related pseudo sub-questions when the former are unavailable. First, a competitive base model named CoG2Q is designed to map complex query qraphs to natural language questions. Afterwards, we propose two extension models, namely CoGSub2Q and CoGSub^m2Q, respectively for the above two scenarios. The former encodes and copies from a sub-question, while the latter further scores and aggregates multiple pseudo sub-questions. Experiment results show that the extension models significantly outperform not only base CoG2Q, but also its augmented variant using simple questions as additional training examples. This demonstrates the importance of instance-level connections between simple and corresponding complex questions, which may be underexploited by straightforward data augmentation of CoG2Q that builds model-level connections through learned parameters.