Alert button
Picture for Bo Du

Bo Du

Alert button

BenchTemp: A General Benchmark for Evaluating Temporal Graph Neural Networks

Aug 31, 2023
Qiang Huang, Jiawei Jiang, Xi Susie Rao, Ce Zhang, Zhichao Han, Zitao Zhang, Xin Wang, Yongjun He, Quanqing Xu, Yang Zhao, Chuang Hu, Shuo Shang, Bo Du

To handle graphs in which features or connectivities are evolving over time, a series of temporal graph neural networks (TGNNs) have been proposed. Despite the success of these TGNNs, the previous TGNN evaluations reveal several limitations regarding four critical issues: 1) inconsistent datasets, 2) inconsistent evaluation pipelines, 3) lacking workload diversity, and 4) lacking efficient comparison. Overall, there lacks an empirical study that puts TGNN models onto the same ground and compares them comprehensively. To this end, we propose BenchTemp, a general benchmark for evaluating TGNN models on various workloads. BenchTemp provides a set of benchmark datasets so that different TGNN models can be fairly compared. Further, BenchTemp engineers a standard pipeline that unifies the TGNN evaluation. With BenchTemp, we extensively compare the representative TGNN models on different tasks (e.g., link prediction and node classification) and settings (transductive and inductive), w.r.t. both effectiveness and efficiency metrics. We have made BenchTemp publicly available at https://github.com/qianghuangwhu/benchtemp.

* 28 pages, 23 figures, 27 tables. Submitted to the Conference on Neural Information Processing Systems 2023 Track on Datasets and Benchmarks 
Viaarxiv icon

SAAN: Similarity-aware attention flow network for change detection with VHR remote sensing images

Aug 28, 2023
Haonan Guo, Xin Su, Chen Wu, Bo Du, Liangpei Zhang

Figure 1 for SAAN: Similarity-aware attention flow network for change detection with VHR remote sensing images
Figure 2 for SAAN: Similarity-aware attention flow network for change detection with VHR remote sensing images
Figure 3 for SAAN: Similarity-aware attention flow network for change detection with VHR remote sensing images
Figure 4 for SAAN: Similarity-aware attention flow network for change detection with VHR remote sensing images

Change detection (CD) is a fundamental and important task for monitoring the land surface dynamics in the earth observation field. Existing deep learning-based CD methods typically extract bi-temporal image features using a weight-sharing Siamese encoder network and identify change regions using a decoder network. These CD methods, however, still perform far from satisfactorily as we observe that 1) deep encoder layers focus on irrelevant background regions and 2) the models' confidence in the change regions is inconsistent at different decoder stages. The first problem is because deep encoder layers cannot effectively learn from imbalanced change categories using the sole output supervision, while the second problem is attributed to the lack of explicit semantic consistency preservation. To address these issues, we design a novel similarity-aware attention flow network (SAAN). SAAN incorporates a similarity-guided attention flow module with deeply supervised similarity optimization to achieve effective change detection. Specifically, we counter the first issue by explicitly guiding deep encoder layers to discover semantic relations from bi-temporal input images using deeply supervised similarity optimization. The extracted features are optimized to be semantically similar in the unchanged regions and dissimilar in the changing regions. The second drawback can be alleviated by the proposed similarity-guided attention flow module, which incorporates similarity-guided attention modules and attention flow mechanisms to guide the model to focus on discriminative channels and regions. We evaluated the effectiveness and generalization ability of the proposed method by conducting experiments on a wide range of CD tasks. The experimental results demonstrate that our method achieves excellent performance on several CD tasks, with discriminative features and semantic consistency preserved.

* 15 pages,13 figures 
Viaarxiv icon

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

Aug 15, 2023
Qiwei Li, Zuchao Li, Xiantao Cai, Bo Du, Hai Zhao

Figure 1 for Enhancing Visually-Rich Document Understanding via Layout Structure Modeling
Figure 2 for Enhancing Visually-Rich Document Understanding via Layout Structure Modeling
Figure 3 for Enhancing Visually-Rich Document Understanding via Layout Structure Modeling
Figure 4 for Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

In recent years, the use of multi-modal pre-trained Transformers has led to significant advancements in visually-rich document understanding. However, existing models have mainly focused on features such as text and vision while neglecting the importance of layout relationship between text nodes. In this paper, we propose GraphLayoutLM, a novel document understanding model that leverages the modeling of layout structure graph to inject document layout knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm to adjust the text sequence based on the graph structure. Additionally, our model uses a layout-aware multi-head self-attention layer to learn document layout knowledge. The proposed model enables the understanding of the spatial arrangement of text elements, improving document comprehension. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results among these datasets. Our experimental results demonstrate that our proposed method provides a significant improvement over existing approaches and showcases the importance of incorporating layout information into document understanding models. We also conduct an ablation study to investigate the contribution of each component of our model. The results show that both the graph reordering algorithm and the layout-aware multi-head self-attention layer play a crucial role in achieving the best performance.

Viaarxiv icon

Rethinking the Localization in Weakly Supervised Object Localization

Aug 11, 2023
Rui Xu, Yong Luo, Han Hu, Bo Du, Jialie Shen, Yonggang Wen

Figure 1 for Rethinking the Localization in Weakly Supervised Object Localization
Figure 2 for Rethinking the Localization in Weakly Supervised Object Localization
Figure 3 for Rethinking the Localization in Weakly Supervised Object Localization
Figure 4 for Rethinking the Localization in Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) is one of the most popular and challenging tasks in computer vision. This task is to localize the objects in the images given only the image-level supervision. Recently, dividing WSOL into two parts (class-agnostic object localization and object classification) has become the state-of-the-art pipeline for this task. However, existing solutions under this pipeline usually suffer from the following drawbacks: 1) they are not flexible since they can only localize one object for each image due to the adopted single-class regression (SCR) for localization; 2) the generated pseudo bounding boxes may be noisy, but the negative impact of such noise is not well addressed. To remedy these drawbacks, we first propose to replace SCR with a binary-class detector (BCD) for localizing multiple objects, where the detector is trained by discriminating the foreground and background. Then we design a weighted entropy (WE) loss using the unlabeled data to reduce the negative impact of noisy bounding boxes. Extensive experiments on the popular CUB-200-2011 and ImageNet-1K datasets demonstrate the effectiveness of our method.

* Accepted by ACM International Conference on Multimedia 2023 
Viaarxiv icon

IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer

Aug 07, 2023
Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y. Al Hammadi, Jizhe Zhou

Figure 1 for IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer
Figure 2 for IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer
Figure 3 for IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer
Figure 4 for IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer

Advanced image tampering techniques are increasingly challenging the trustworthiness of multimedia, leading to the development of Image Manipulation Localization (IML). But what makes a good IML model? The answer lies in the way to capture artifacts. Exploiting artifacts requires the model to extract non-semantic discrepancies between manipulated and authentic regions, necessitating explicit comparisons between the two areas. With the self-attention mechanism, naturally, the Transformer should be a better candidate to capture artifacts. However, due to limited datasets, there is currently no pure ViT-based approach for IML to serve as a benchmark, and CNNs dominate the entire task. Nevertheless, CNNs suffer from weak long-range and non-semantic modeling. To bridge this gap, based on the fact that artifacts are sensitive to image resolution, amplified under multi-scale features, and massive at the manipulation border, we formulate the answer to the former question as building a ViT with high-resolution capacity, multi-scale feature extraction capability, and manipulation edge supervision that could converge with a small amount of data. We term this simple but effective ViT paradigm IML-ViT, which has significant potential to become a new benchmark for IML. Extensive experiments on five benchmark datasets verified our model outperforms the state-of-the-art manipulation localization methods.Code and models are available at \url{https://github.com/SunnyHaze/IML-ViT}.

Viaarxiv icon

Scale-aware Test-time Click Adaptation for Pulmonary Nodule and Mass Segmentation

Jul 28, 2023
Zhihao Li, Jiancheng Yang, Yongchao Xu, Li Zhang, Wenhui Dong, Bo Du

Figure 1 for Scale-aware Test-time Click Adaptation for Pulmonary Nodule and Mass Segmentation
Figure 2 for Scale-aware Test-time Click Adaptation for Pulmonary Nodule and Mass Segmentation
Figure 3 for Scale-aware Test-time Click Adaptation for Pulmonary Nodule and Mass Segmentation
Figure 4 for Scale-aware Test-time Click Adaptation for Pulmonary Nodule and Mass Segmentation

Pulmonary nodules and masses are crucial imaging features in lung cancer screening that require careful management in clinical diagnosis. Despite the success of deep learning-based medical image segmentation, the robust performance on various sizes of lesions of nodule and mass is still challenging. In this paper, we propose a multi-scale neural network with scale-aware test-time adaptation to address this challenge. Specifically, we introduce an adaptive Scale-aware Test-time Click Adaptation method based on effortlessly obtainable lesion clicks as test-time cues to enhance segmentation performance, particularly for large lesions. The proposed method can be seamlessly integrated into existing networks. Extensive experiments on both open-source and in-house datasets consistently demonstrate the effectiveness of the proposed method over some CNN and Transformer-based segmentation methods. Our code is available at https://github.com/SplinterLi/SaTTCA

* 11 pages, 3 figures, MICCAI 2023 
Viaarxiv icon

IML-ViT: Image Manipulation Localization by Vision Transformer

Jul 27, 2023
Xiaochen Ma, Bo Du, Xianggen Liu, Ahmed Y. Al Hammadi, Jizhe Zhou

Figure 1 for IML-ViT: Image Manipulation Localization by Vision Transformer
Figure 2 for IML-ViT: Image Manipulation Localization by Vision Transformer
Figure 3 for IML-ViT: Image Manipulation Localization by Vision Transformer
Figure 4 for IML-ViT: Image Manipulation Localization by Vision Transformer

Advanced image tampering techniques are increasingly challenging the trustworthiness of multimedia, leading to the development of Image Manipulation Localization (IML). But what makes a good IML model? The answer lies in the way to capture artifacts. Exploiting artifacts requires the model to extract non-semantic discrepancies between the manipulated and authentic regions, which needs to compare differences between these two areas explicitly. With the self-attention mechanism, naturally, the Transformer is the best candidate. Besides, artifacts are sensitive to image resolution, amplified under multi-scale features, and massive at the manipulation border. Therefore, we formulate the answer to the former question as building a ViT with high-resolution capacity, multi-scale feature extraction capability, and manipulation edge supervision. We term this simple but effective ViT paradigm as the IML-ViT, which has great potential to become a new benchmark for IML. Extensive experiments on five benchmark datasets verified our model outperforms the state-of-the-art manipulation localization methods. Code and models are available at \url{https://github.com/SunnyHaze/IML-ViT}

Viaarxiv icon

PNT-Edge: Towards Robust Edge Detection with Noisy Labels by Learning Pixel-level Noise Transitions

Jul 26, 2023
Wenjie Xuan, Shanshan Zhao, Yu Yao, Juhua Liu, Tongliang Liu, Yixin Chen, Bo Du, Dacheng Tao

Figure 1 for PNT-Edge: Towards Robust Edge Detection with Noisy Labels by Learning Pixel-level Noise Transitions
Figure 2 for PNT-Edge: Towards Robust Edge Detection with Noisy Labels by Learning Pixel-level Noise Transitions
Figure 3 for PNT-Edge: Towards Robust Edge Detection with Noisy Labels by Learning Pixel-level Noise Transitions
Figure 4 for PNT-Edge: Towards Robust Edge Detection with Noisy Labels by Learning Pixel-level Noise Transitions

Relying on large-scale training data with pixel-level labels, previous edge detection methods have achieved high performance. However, it is hard to manually label edges accurately, especially for large datasets, and thus the datasets inevitably contain noisy labels. This label-noise issue has been studied extensively for classification, while still remaining under-explored for edge detection. To address the label-noise issue for edge detection, this paper proposes to learn Pixel-level NoiseTransitions to model the label-corruption process. To achieve it, we develop a novel Pixel-wise Shift Learning (PSL) module to estimate the transition from clean to noisy labels as a displacement field. Exploiting the estimated noise transitions, our model, named PNT-Edge, is able to fit the prediction to clean labels. In addition, a local edge density regularization term is devised to exploit local structure information for better transition learning. This term encourages learning large shifts for the edges with complex local structures. Experiments on SBD and Cityscapes demonstrate the effectiveness of our method in relieving the impact of label noise. Codes will be available at github.

Viaarxiv icon

Building-road Collaborative Extraction from Remotely Sensed Images via Cross-Interaction

Jul 23, 2023
Haonan Guo, Xin Su, Chen Wu, Bo Du, Liangpei Zhang

Figure 1 for Building-road Collaborative Extraction from Remotely Sensed Images via Cross-Interaction
Figure 2 for Building-road Collaborative Extraction from Remotely Sensed Images via Cross-Interaction
Figure 3 for Building-road Collaborative Extraction from Remotely Sensed Images via Cross-Interaction
Figure 4 for Building-road Collaborative Extraction from Remotely Sensed Images via Cross-Interaction

Buildings are the basic carrier of social production and human life; roads are the links that interconnect social networks. Building and road information has important application value in the frontier fields of regional coordinated development, disaster prevention, auto-driving, etc. Mapping buildings and roads from very high-resolution (VHR) remote sensing images have become a hot research topic. However, the existing methods often ignore the strong spatial correlation between roads and buildings and extract them in isolation. To fully utilize the complementary advantages between buildings and roads, we propose a building-road collaborative extraction method based on multi-task and cross-scale feature interaction to improve the accuracy of both tasks in a complementary way. A multi-task interaction module is proposed to interact information across tasks and preserve the unique information of each task, which tackle the seesaw phenomenon in multitask learning. By considering the variation in appearance and structure between buildings and roads, a cross-scale interaction module is designed to automatically learn the optimal reception field for different tasks. Compared with many existing methods that train each task individually, the proposed collaborative extraction method can utilize the complementary advantages between buildings and roads by the proposed inter-task and inter-scale feature interactions, and automatically select the optimal reception field for different tasks. Experiments on a wide range of urban and rural scenarios show that the proposed algorithm can achieve building-road extraction with outstanding performance and efficiency.

* 34 pages,9 figures, submitted to ISPRS Journal of Photogrammetry and Remote Sensing 
Viaarxiv icon

Expediting Building Footprint Segmentation from High-resolution Remote Sensing Images via progressive lenient supervision

Jul 23, 2023
Haonan Guo, Bo Du, Chen Wu, Xin Su, Liangpei Zhang

Figure 1 for Expediting Building Footprint Segmentation from High-resolution Remote Sensing Images via progressive lenient supervision
Figure 2 for Expediting Building Footprint Segmentation from High-resolution Remote Sensing Images via progressive lenient supervision
Figure 3 for Expediting Building Footprint Segmentation from High-resolution Remote Sensing Images via progressive lenient supervision
Figure 4 for Expediting Building Footprint Segmentation from High-resolution Remote Sensing Images via progressive lenient supervision

The efficacy of building footprint segmentation from remotely sensed images has been hindered by model transfer effectiveness. Many existing building segmentation methods were developed upon the encoder-decoder architecture of U-Net, in which the encoder is finetuned from the newly developed backbone networks that are pre-trained on ImageNet. However, the heavy computational burden of the existing decoder designs hampers the successful transfer of these modern encoder networks to remote sensing tasks. Even the widely-adopted deep supervision strategy fails to mitigate these challenges due to its invalid loss in hybrid regions where foreground and background pixels are intermixed. In this paper, we conduct a comprehensive evaluation of existing decoder network designs for building footprint segmentation and propose an efficient framework denoted as BFSeg to enhance learning efficiency and effectiveness. Specifically, a densely-connected coarse-to-fine feature fusion decoder network that facilitates easy and fast feature fusion across scales is proposed. Moreover, considering the invalidity of hybrid regions in the down-sampled ground truth during the deep supervision process, we present a lenient deep supervision and distillation strategy that enables the network to learn proper knowledge from deep supervision. Building upon these advancements, we have developed a new family of building segmentation networks, which consistently surpass prior works with outstanding performance and efficiency across a wide range of newly developed encoder networks. The code will be released on https://github.com/HaonanGuo/BFSeg-Efficient-Building-Footprint-Segmentation-Framework.

* 13 pages,8 figures. Submitted to IEEE Transactions on Neural Networks and Learning Systems 
Viaarxiv icon