Image classification has achieved unprecedented advance with the the rapid development of deep learning. However, the classification of tiny object images is still not well investigated. In this paper, we first briefly review the development of Convolutional Neural Network and Visual Transformer in deep learning, and introduce the sources and development of conventional noises and adversarial attacks. Then we use various models of Convolutional Neural Network and Visual Transformer to conduct a series of experiments on the image dataset of tiny objects (sperms and impurities), and compare various evaluation metrics in the experimental results to obtain a model with stable performance. Finally, we discuss the problems in the classification of tiny objects and make a prospect for the classification of tiny objects in the future.
GasHisSDB is a New Gastric Histopathology Subsize Image Database with a total of 245196 images. GasHisSDB is divided into 160*160 pixels sub-database, 120*120 pixels sub-database and 80*80 pixels sub-database. GasHisSDB is made to realize the function of valuating image classification. In order to prove that the methods of different periods in the field of image classification have discrepancies on GasHisSDB, we select a variety of classifiers for evaluation. Seven classical machine learning classifiers, three CNN classifiers and a novel transformer-based classifier are selected for testing on image classification tasks. GasHisSDB is available at the URL:https://github.com/NEUhwm/GasHisSDB.git.
Existing deep learning methods for diagnosis of gastric cancer commonly use convolutional neural networks (CNN). Recently, the Visual Transformer (VT) has attracted a major attention because of its performance and efficiency, but its applications are mostly in the field of computer vision. In this paper, a multi-scale visual transformer model, referred to as GasHis-Transformer, is proposed for gastric histopathology image classification (GHIC), which enables the automatic classification of microscopic gastric images into abnormal and normal cases. The GasHis-Transformer model consists of two key modules: a global information module (GIM) and a local information module (LIM) to extract pathological features effectively. In our experiments, a public hematoxylin and eosin (H&E) stained gastric histopathology dataset with 280 abnormal or normal images using the GasHis-Transformer model is applied to estimate precision, recall, F1-score, and accuracy on the testing set as 98.0%, 100.0%, 96.0% and 98.0% respectively. Furthermore, a critical study is conducted to evaluate the robustness of GasHis-Transformer according to add ten different noises including adversarial attack and traditional image noise. In addition, a clinically meaningful study is executed to test the gastric cancer identification of GasHis-Transformerwith 420 abnormal images and achieves 96.2% accuracy. Finally, a comparative study is performed to test the generalizability with both H&E and Immunohistochemical (IHC) stained images on a lymphoma image dataset, a breast cancer dataset and a cervical cancer dataset, producing comparable F1-scores (85.6%, 82.8% and 65.7%, respectively) and accuracy (83.9%, 89.4% and 65.7%, respectively) respectively. In conclusion, GasHis-Transformerdemonstrates a high classification performance and shows its significant potential in histopathology image analysis.
Cervical cancer is a very common and fatal cancer in women, but it can be prevented through early examination and treatment. Cytopathology images are often used to screen for cancer. Then, because of the possibility of artificial errors due to the large number of this method, the computer-aided diagnosis system based on deep learning is developed. The image input required by the deep learning method is usually consistent, but the size of the clinical medical image is inconsistent. The internal information is lost after resizing the image directly, so it is unreasonable. A lot of research is to directly resize the image, and the results are still robust. In order to find a reasonable explanation, 22 deep learning models are used to process images of different scales, and experiments are conducted on the SIPaKMeD dataset. The conclusion is that the deep learning method is very robust to the size changes of images. This conclusion is also validated on the Herlev dataset.
Siamese trackers are shown to be vulnerable to adversarial attacks recently. However, the existing attack methods craft the perturbations for each video independently, which comes at a non-negligible computational cost. In this paper, we show the existence of universal perturbations that can enable the targeted attack, e.g., forcing a tracker to follow the ground-truth trajectory with specified offsets, to be video-agnostic and free from inference in a network. Specifically, we attack a tracker by adding a universal imperceptible perturbation to the template image and adding a fake target, i.e., a small universal adversarial patch, into the search images adhering to the predefined trajectory, so that the tracker outputs the location and size of the fake target instead of the real target. Our approach allows perturbing a novel video to come at no additional cost except the mere addition operations -- and not require gradient optimization or network inference. Experimental results on several datasets demonstrate that our approach can effectively fool the Siamese trackers in a targeted attack manner. We show that the proposed perturbations are not only universal across videos, but also generalize well across different trackers. Such perturbations are therefore doubly universal, both with respect to the data and the network architectures. We will make our code publicly available.
Recent one-stage object detectors follow a per-pixel prediction approach that predicts both the object category scores and boundary positions from every single grid location. However, the most suitable positions for inferring different targets, i.e., the object category and boundaries, are generally different. Predicting all these targets from the same grid location thus may lead to sub-optimal results. In this paper, we analyze the suitable inference positions for object category and boundaries, and propose a prediction-target-decoupled detector named PDNet to establish a more flexible detection paradigm. Our PDNet with the prediction decoupling mechanism encodes different targets separately in different locations. A learnable prediction collection module is devised with two sets of dynamic points, i.e., dynamic boundary points and semantic points, to collect and aggregate the predictions from the favorable regions for localization and classification. We adopt a two-step strategy to learn these dynamic point positions, where the prior positions are estimated for different targets first, and the network further predicts residual offsets to the positions with better perceptions of the object properties. Extensive experiments on the MS COCO benchmark demonstrate the effectiveness and efficiency of our method. With a single ResNeXt-64x4d-101 as the backbone, our detector achieves 48.7 AP with single-scale testing, which outperforms the state-of-the-art methods by an appreciable margin under the same experimental settings. Moreover, our detector is highly efficient as a one-stage framework. Our code will be public.
The one-shot multi-object tracking, which integrates object detection and ID embedding extraction into a unified network, has achieved groundbreaking results in recent years. However, current one-shot trackers solely rely on single-frame detections to predict candidate bounding boxes, which may be unreliable when facing disastrous visual degradation, e.g., motion blur, occlusions. Once a target bounding box is mistakenly classified as background by the detector, the temporal consistency of its corresponding tracklet will be no longer maintained, as shown in Fig. 1. In this paper, we set out to restore the misclassified bounding boxes, i.e., fake background, by proposing a re-check network. The re-check network propagates previous tracklets to the current frame by exploring the relation between cross-frame temporal cues and current candidates using the modified cross-correlation layer. The propagation results help to reload the "fake background" and eventually repair the broken tracklets. By inserting the re-check network to a strong baseline tracker CSTrack (a variant of JDE), our model achieves favorable gains by $70.7 \rightarrow 76.7$, $70.6 \rightarrow 76.3$ MOTA on MOT16 and MOT17, respectively. Code is publicly available at https://github.com/JudasDie/SOTS.