The extraction of lakes from remote sensing images is a complex challenge due to the varied lake shapes and data noise. Current methods rely on multispectral image datasets, making it challenging to learn lake features accurately from pixel arrangements. This, in turn, affects model learning and the creation of accurate segmentation masks. This paper introduces a unified prompt-based dataset construction approach that provides approximate lake locations using point, box, and mask prompts. We also propose a two-stage prompt enhancement framework, LEPrompter, which involves prompt-based and prompt-free stages during training. The prompt-based stage employs a prompt encoder to extract prior information, integrating prompt tokens and image embeddings through self- and cross-attention in the prompt decoder. Prompts are deactivated once the model is trained to ensure independence during inference, enabling automated lake extraction. Evaluations on Surface Water and Qinghai-Tibet Plateau Lake datasets show consistent performance improvements compared to the previous state-of-the-art method. LEPrompter achieves mIoU scores of 91.48% and 97.43% on the respective datasets without introducing additional parameters or GFLOPs. Supplementary materials provide the source code, pre-trained models, and detailed user studies.
Lake extraction from remote sensing imagery is challenging due to the complex shapes of lakes and the presence of noise. Existing methods suffer from blurred segmentation boundaries and poor foreground modeling. In this paper, we propose a hybrid CNN-Transformer architecture, called LEFormer, for accurate lake extraction. LEFormer contains four main modules: CNN encoder, Transformer encoder, cross-encoder fusion, and lightweight decoder. The CNN encoder recovers local spatial information and improves fine-scale details. Simultaneously, the Transformer encoder captures long-range dependencies between sequences of any length, allowing them to obtain global features and context information better. Finally, a lightweight decoder is employed for mask prediction. We evaluate the performance and efficiency of LEFormer on two datasets, the Surface Water (SW) and the Qinghai-Tibet Plateau Lake (QTPL). Experimental results show that LEFormer consistently achieves state-of-the-art (SOTA) performance and efficiency on these two datasets, outperforming existing methods. Specifically, LEFormer achieves 90.86% and 97.42% mIoU on the SW and QTPL datasets with a parameter count of 3.61M, respectively, while being 20x minor than the previous SOTA method.
Accurate detection and localization of X-corner on both planar and non-planar patterns is a core step in robotics and machine vision. However, previous works could not make a good balance between accuracy and robustness, which are both crucial criteria to evaluate the detectors performance. To address this problem, in this paper we present a novel detection algorithm which can maintain high sub-pixel precision on inputs under multiple interference, such as lens distortion, extreme poses and noise. The whole algorithm, adopting a coarse-to-fine strategy, contains a X-corner detection network and three post-processing techniques to distinguish the correct corner candidates, as well as a mixed sub-pixel refinement technique and an improved region growth strategy to recover the checkerboard pattern partially visible or occluded automatically. Evaluations on real and synthetic images indicate that the presented algorithm has the higher detection rate, sub-pixel accuracy and robustness than other commonly used methods. Finally, experiments of camera calibration and pose estimation verify it can also get smaller re-projection error in quantitative comparisons to the state-of-the-art.
The terahertz (THz) band (0.1-10 THz) is widely considered to be a candidate band for the sixth-generation mobile communication technology (6G). However, due to its short wavelength (less than 1 mm), scattering becomes a particularly significant propagation mechanism. In previous studies, we proposed a scattering model to characterize the scattering in THz bands, which can only reconstruct the scattering in the incidence plane. In this paper, a three-dimensional (3D) stochastic model is proposed to characterize the THz scattering on rough surfaces. Then, we reconstruct the scattering on rough surfaces with different shapes and under different incidence angles utilizing the proposed model. Good agreements can be achieved between the proposed model and full-wave simulation results. This stochastic 3D scattering model can be integrated into the standard channel modeling framework to realize more realistic THz channel data for the evaluation of 6G.
Same-style products retrieval plays an important role in e-commerce platforms, aiming to identify the same products which may have different text descriptions or images. It can be used for similar products retrieval from different suppliers or duplicate products detection of one supplier. Common methods use the image as the detected object, but they only consider the visual features and overlook the attribute information contained in the textual descriptions, and perform weakly for products in image less important industries like machinery, hardware tools and electronic component, even if an additional text matching module is added. In this paper, we propose a unified vision-language modeling method for e-commerce same-style products retrieval, which is designed to represent one product with its textual descriptions and visual contents. It contains one sampling skill to collect positive pairs from user click log with category and relevance constrained, and a novel contrastive loss unit to model the image, text, and image+text representations into one joint embedding space. It is capable of cross-modal product-to-product retrieval, as well as style transfer and user-interactive search. Offline evaluations on annotated data demonstrate its superior retrieval performance, and online testings show it can attract more clicks and conversions. Moreover, this model has already been deployed online for similar products retrieval in alibaba.com, the largest B2B e-commerce platform in the world.
Aiming to improve the checkerboard corner detection robustness against the images with poor quality, such as lens distortion, extreme poses, and noise, we propose a novel detection algorithm which can maintain high accuracy on inputs under multiply scenarios without any prior knowledge of the checkerboard pattern. This whole algorithm includes a checkerboard corner detection network and some post-processing techniques. The network model is a fully convolutional network with improvements of loss function and learning rate, which can deal with the images of arbitrary size and produce correspondingly-sized output with a corner score on each pixel by efficient inference and learning. Besides, in order to remove the false positives, we employ three post-processing techniques including threshold related to maximum response, non-maximum suppression, and clustering. Evaluations on two different datasets show its superior robustness, accuracy and wide applicability in quantitative comparisons with the state-of-the-art methods, like MATE, ChESS, ROCHADE and OCamCalib.
An obstacle to scientific document understanding is the extensive use of acronyms which are shortened forms of long technical phrases. Acronym disambiguation aims to find the correct meaning of an ambiguous acronym in a given text. Recent efforts attempted to incorporate word embeddings and deep learning architectures, and achieved significant effects in this task. In general domains, kinds of fine-grained pretrained language models have sprung up, thanks to the largescale corpora which can usually be obtained through crowdsourcing. However, these models based on domain agnostic knowledge might achieve insufficient performance when directly applied to the scientific domain. Moreover, obtaining large-scale high-quality annotated data and representing high-level semantics in the scientific domain is challenging and expensive. In this paper, we consider both the domain agnostic and specific knowledge, and propose a Hierarchical Dual-path BERT method coined hdBERT to capture the general fine-grained and high-level specific representations for acronym disambiguation. First, the context-based pretrained models, RoBERTa and SciBERT, are elaborately involved in encoding these two kinds of knowledge respectively. Second, multiple layer perceptron is devised to integrate the dualpath representations simultaneously and outputs the prediction. With a widely adopted SciAD dataset contained 62,441 sentences, we investigate the effectiveness of hdBERT. The experimental results exhibit that the proposed approach outperforms state-of-the-art methods among various evaluation metrics. Specifically, its macro F1 achieves 93.73%.
We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers. In contrast to random masking strategy of recent VL models, we design alignment guided masking to jointly focus more on image-text semantic relations. To this end, we carry out five novel tasks, i.e., rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for self-supervised VL pre-training at patches of different scale. Kaleido-BERT is conceptually simple and easy to extend to the existing BERT framework, it attains new state-of-the-art results by large margins on four downstream tasks, including text retrieval (R@1: 4.03% absolute improvement), image retrieval (R@1: 7.13% abs imv.), category recognition (ACC: 3.28% abs imv.), and fashion captioning (Bleu4: 1.2 abs imv.). We validate the efficiency of Kaleido-BERT on a wide range of e-commerical websites, demonstrating its broader potential in real-world applications.