On shopping websites, product images of low quality negatively affect customer experience. Although there are plenty of work in detecting images with different defects, few efforts have been dedicated to correct those defects at scale. A major challenge is that there are thousands of product types and each has specific defects, therefore building defect specific models is unscalable. In this paper, we propose a unified Image-to-Image (I2I) translation model to correct multiple defects across different product types. Our model leverages an attention mechanism to hierarchically incorporate high-level defect groups and specific defect types to guide the network to focus on defect-related image regions. Evaluated on eight public datasets, our model reduces the Frechet Inception Distance (FID) by 24.6% in average compared with MoNCE, the state-of-the-art I2I method. Unlike public data, another practical challenge on shopping websites is that some paired images are of low quality. Therefore we design our model to be semi-paired by combining the L1 loss of paired data with the cycle loss of unpaired data. Tested on a shopping website dataset to correct three image defects, our model reduces (FID) by 63.2% in average compared with WS-I2I, the state-of-the art semi-paired I2I method.
Among the current mainstream change detection networks, transformer is deficient in the ability to capture accurate low-level details, while convolutional neural network (CNN) is wanting in the capacity to understand global information and establish remote spatial relationships. Meanwhile, both of the widely used early fusion and late fusion frameworks are not able to well learn complete change features. Therefore, based on swin transformer V2 (Swin V2) and VGG16, we propose an end-to-end compounded dense network SwinV2DNet to inherit the advantages of both transformer and CNN and overcome the shortcomings of existing networks in feature learning. Firstly, it captures the change relationship features through the densely connected Swin V2 backbone, and provides the low-level pre-changed and post-changed features through a CNN branch. Based on these three change features, we accomplish accurate change detection results. Secondly, combined with transformer and CNN, we propose mixed feature pyramid (MFP) which provides inter-layer interaction information and intra-layer multi-scale information for complete feature learning. MFP is a plug and play module which is experimentally proven to be also effective in other change detection networks. Further more, we impose a self-supervision strategy to guide a new CNN branch, which solves the untrainable problem of the CNN branch and provides the semantic change information for the features of encoder. The state-of-the-art (SOTA) change detection scores and fine-grained change maps were obtained compared with other advanced methods on four commonly used public remote sensing datasets. The code is available at https://github.com/DalongZ/SwinV2DNet.
Automatic labeling of coronary arteries is an essential task in the practical diagnosis process of cardiovascular diseases. For experienced radiologists, the anatomically predetermined connections are important for labeling the artery segments accurately, while this prior knowledge is barely explored in previous studies. In this paper, we present a new framework called TopoLab which incorporates the anatomical connections into the network design explicitly. Specifically, the strategies of intra-segment feature aggregation and inter-segment feature interaction are introduced for hierarchical segment feature extraction. Moreover, we propose the anatomy-aware connection classifier to enable classification for each connected segment pair, which effectively exploits the prior topology among the arteries with different categories. To validate the effectiveness of our method, we contribute high-quality annotations of artery labeling to the public orCaScore dataset. The experimental results on both the orCaScore dataset and an in-house dataset show that our TopoLab has achieved state-of-the-art performance.
Graph Pooling technology plays an important role in graph node classification tasks. Sorting pooling technologies maintain large-value units for pooling graphs of varying sizes. However, by analyzing the statistical characteristic of activated units after pooling, we found that a large number of units dropped by sorting pooling are negative-value units that contain useful information and can contribute considerably to the final decision. To maintain more useful information, a novel pooling technology, called Geometric Pooling (GP), was proposed to contain the unique node features with negative values by measuring the similarity of all node features. We reveal the effectiveness of GP from the entropy reduction view. The experiments were conducted on TUdatasets to show the effectiveness of GP. The results showed that the proposed GP outperforms the SOTA graph pooling technologies by 1%\sim5% with fewer parameters.
In this paper, we introduce AdaSelection, an adaptive sub-sampling method to identify the most informative sub-samples within each minibatch to speed up the training of large-scale deep learning models without sacrificing model performance. Our method is able to flexibly combines an arbitrary number of baseline sub-sampling methods incorporating the method-level importance and intra-method sample-level importance at each iteration. The standard practice of ad-hoc sampling often leads to continuous training with vast amounts of data from production environments. To improve the selection of data instances during forward and backward passes, we propose recording a constant amount of information per instance from these passes. We demonstrate the effectiveness of our method by testing it across various types of inputs and tasks, including the classification tasks on both image and language datasets, as well as regression tasks. Compared with industry-standard baselines, AdaSelection consistently displays superior performance.
Localization of the narrowest position of the vessel and corresponding vessel and remnant vessel delineation in carotid ultrasound (US) are essential for carotid stenosis grading (CSG) in clinical practice. However, the pipeline is time-consuming and tough due to the ambiguous boundaries of plaque and temporal variation. To automatize this procedure, a large number of manual delineations are usually required, which is not only laborious but also not reliable given the annotation difficulty. In this study, we present the first video classification framework for automatic CSG. Our contribution is three-fold. First, to avoid the requirement of laborious and unreliable annotation, we propose a novel and effective video classification network for weakly-supervised CSG. Second, to ease the model training, we adopt an inflation strategy for the network, where pre-trained 2D convolution weights can be adapted into the 3D counterpart in our network for an effective warm start. Third, to enhance the feature discrimination of the video, we propose a novel attention-guided multi-dimension fusion (AMDF) transformer encoder to model and integrate global dependencies within and across spatial and temporal dimensions, where two lightweight cross-dimensional attention mechanisms are designed. Our approach is extensively validated on a large clinically collected carotid US video dataset, demonstrating state-of-the-art performance compared with strong competitors.
The machine reading comprehension (MRC) of user manuals has huge potential in customer service. However,current methods have trouble answering complex questions. Therefore, we introduce the Knowing-how & Knowing-that task that requires the model to answer factoid-style, procedure-style, and inconsistent questions about user manuals. We resolve this task by jointly representing the steps and facts in a graph (TARA), which supports a unified inference of various questions. Towards a systematical benchmarking study, we design a heuristic method to automatically parse user manuals into TARAs and build an annotated dataset to test the model's ability in answering real-world questions. Empirical results demonstrate that representing user manuals as TARAs is a desired solution for the MRC of user manuals. An in-depth investigation of TARA further sheds light on the issues and broader impacts of future representations of user manuals. We hope our work can move the MRC of user manuals to a more complex and realistic stage.
Automatic detection of machine anomaly remains challenging for machine learning. We believe the capability of generative adversarial network (GAN) suits the need of machine audio anomaly detection, yet rarely has this been investigated by previous work. In this paper, we propose AEGAN-AD, a totally unsupervised approach in which the generator (also an autoencoder) is trained to reconstruct input spectrograms. It is pointed out that the denoising nature of reconstruction deprecates its capacity. Thus, the discriminator is redesigned to aid the generator during both training stage and detection stage. The performance of AEGAN-AD on the dataset of DCASE 2022 Challenge TASK 2 demonstrates the state-of-the-art result on five machine types. A novel anomaly localization method is also investigated. Source code available at: www.github.com/jianganbai/AEGAN-AD
In this paper, we study the problem of jointly estimating the optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employ a complex pipeline that splits the joint task into independent stages, or fuse 2D and 3D information in an ``early-fusion'' or ``late-fusion'' manner. Such one-size-fits-all approaches suffer from a dilemma of failing to fully utilize the characteristic of each modality or to maximize the inter-modality complementarity. To address the problem, we propose a novel end-to-end framework, which consists of 2D and 3D branches with multiple bidirectional fusion connections between them in specific layers. Different from previous work, we apply a point-based 3D branch to extract the LiDAR features, as it preserves the geometric structure of point clouds. To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM). We instantiate two types of the bidirectional fusion pipeline, one based on the pyramidal coarse-to-fine architecture (dubbed CamLiPWC), and the other one based on the recurrent all-pairs field transforms (dubbed CamLiRAFT). On FlyingThings3D, both CamLiPWC and CamLiRAFT surpass all existing methods and achieve up to a 47.9\% reduction in 3D end-point-error from the best published result. Our best-performing model, CamLiRAFT, achieves an error of 4.26\% on the KITTI Scene Flow benchmark, ranking 1st among all submissions with much fewer parameters. Besides, our methods have strong generalization performance and the ability to handle non-rigid motion. Code is available at https://github.com/MCG-NJU/CamLiFlow.
Recently, min-max optimization problems have received increasing attention due to their wide range of applications in machine learning (ML). However, most existing min-max solution techniques are either single-machine or distributed algorithms coordinated by a central server. In this paper, we focus on the decentralized min-max optimization for learning with domain constraints, where multiple agents collectively solve a nonconvex-strongly-concave min-max saddle point problem without coordination from any server. Decentralized min-max optimization problems with domain constraints underpins many important ML applications, including multi-agent ML fairness assurance, and policy evaluations in multi-agent reinforcement learning. We propose an algorithm called PRECISION (proximal gradient-tracking and stochastic recursive variance reduction) that enjoys a convergence rate of $O(1/T)$, where $T$ is the maximum number of iterations. To further reduce sample complexity, we propose PRECISION$^+$ with an adaptive batch size technique. We show that the fast $O(1/T)$ convergence of PRECISION and PRECISION$^+$ to an $\epsilon$-stationary point imply $O(\epsilon^{-2})$ communication complexity and $O(m\sqrt{n}\epsilon^{-2})$ sample complexity, where $m$ is the number of agents and $n$ is the size of dataset at each agent. To our knowledge, this is the first work that achieves $O(\epsilon^{-2})$ in both sample and communication complexities in decentralized min-max learning with domain constraints. Our experiments also corroborate the theoretical results.