In the field of document understanding, significant advances have been made in the fine-tuning of Multimodal Large Language Models (MLLMs) with instruction-following data. Nevertheless, the potential of text-grounding capability within text-rich scenarios remains underexplored. In this paper, we present a text-grounding document understanding model, termed TGDoc, which addresses this deficiency by enhancing MLLMs with the ability to discern the spatial positioning of text within images. Empirical evidence suggests that text-grounding improves the model's interpretation of textual content, thereby elevating its proficiency in comprehending text-rich images. Specifically, we compile a dataset containing 99K PowerPoint presentations sourced from the internet. We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model. Moreover, we curate a collection of text-rich images and prompt the text-only GPT-4 to generate 12K high-quality conversations, featuring textual locations within text-rich scenarios. By integrating text location data into the instructions, TGDoc is adept at discerning text locations during the visual question process. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
Single-image shadow removal is a significant task that is still unresolved. Most existing deep learning-based approaches attempt to remove the shadow directly, which can not deal with the shadow well. To handle this issue, we consider removing the shadow in a coarse-to-fine fashion and propose a simple but effective Progressive Recurrent Network (PRNet). The network aims to remove the shadow progressively, enabing us to flexibly adjust the number of iterations to strike a balance between performance and time. Our network comprises two parts: shadow feature extraction and progressive shadow removal. Specifically, the first part is a shallow ResNet which constructs the representations of the input shadow image on its original size, preventing the loss of high-frequency details caused by the downsampling operation. The second part has two critical components: the re-integration module and the update module. The proposed re-integration module can fully use the outputs of the previous iteration, providing input for the update module for further shadow removal. In this way, the proposed PRNet makes the whole process more concise and only uses 29% network parameters than the best published method. Extensive experiments on the three benchmarks, ISTD, ISTD+, and SRD, demonstrate that our method can effectively remove shadows and achieve superior performance.
Segment anything model (SAM) has achieved great success in the field of natural image segmentation. Nevertheless, SAM tends to classify shadows as background, resulting in poor segmentation performance for shadow detection task. In this paper, we propose an simple but effective approach for fine tuning SAM to detect shadows. Additionally, we also combine it with long short-term attention mechanism to extend its capabilities to video shadow detection. Specifically, we first fine tune SAM by utilizing shadow data combined with sparse prompts and apply the fine-tuned model to detect a specific frame (e.g., first frame) in the video with a little user assistance. Subsequently, using the detected frame as a reference, we employ a long short-term network to learn spatial correlations between distant frames and temporal consistency between contiguous frames, thereby achieving shadow information propagation across frames. Extensive experimental results demonstrate that our method outperforms the state-of-the-art techniques, with improvements of 17.2% and 3.3% in terms of MAE and IoU, respectively, validating the effectiveness of our method.
Document images captured by mobile devices are usually degraded by uncontrollable illumination, which hampers the clarity of document content. Recently, a series of research efforts have been devoted to correcting the uneven document illumination. However, existing methods rarely consider the use of ambient light information, and usually rely on paired samples including degraded and the corrected ground-truth images which are not always accessible. To this end, we propose UDoc-GAN, the first framework to address the problem of document illumination correction under the unpaired setting. Specifically, we first predict the ambient light features of the document. Then, according to the characteristics of different level of ambient lights, we re-formulate the cycle consistency constraint to learn the underlying relationship between normal and abnormal illumination domains. To prove the effectiveness of our approach, we conduct extensive experiments on DocProj dataset under the unpaired setting. Compared with the state-of-the-art approaches, our method demonstrates promising performance in terms of character error rate (CER) and edit distance (ED), together with better qualitative results for textual detail preservation. The source code is now publicly available at https://github.com/harrytea/UDoc-GAN.