Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lianwen Jin

DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Aug 27, 2024

Wenhui Liao, Jiapeng Wang, Hongliang Li, Chengyu Wang, Jun Huang, Lianwen Jin

Figure 1 for DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Figure 2 for DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Figure 3 for DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Figure 4 for DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Abstract:Text-rich document understanding (TDU) refers to analyzing and comprehending documents containing substantial textual content. With the rapid evolution of large language models (LLMs), they have been widely leveraged for TDU due to their remarkable versatility and generalization. In this paper, we introduce DocLayLLM, an efficient and effective multi-modal extension of LLMs specifically designed for TDU. By integrating visual patch tokens and 2D positional tokens into LLMs and encoding the document content using the LLMs themselves, we fully take advantage of the document comprehension capability of LLMs and enhance their perception of OCR information. We have also deeply considered the role of the chain-of-thought (CoT) and innovatively proposed the techniques of CoT Pre-training and CoT Annealing. Our DocLayLLM can achieve remarkable performances with lightweight training settings, showcasing its efficiency and effectiveness. Experimental results demonstrate that our DocLayLLM surpasses existing OCR-dependent methods and also outperforms OCR-free competitors.

Via

Access Paper or Ask Questions

Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models

Aug 09, 2024

Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai

Figure 1 for Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models

Figure 2 for Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models

Figure 3 for Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models

Figure 4 for Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models

Abstract:Recently, there has been significant interest in enhancing the capability of multimodal large language models (MLLMs) to process high-resolution images. Most existing methods focus on adopting a cropping strategy to improve the ability of multimodal large language models to understand image details. However, this cropping operation inevitably causes the segmentation of objects and connected areas, which impairs the MLLM's ability to recognize small or irregularly shaped objects or text. This issue is particularly evident in lightweight MLLMs. Addressing this issue, we propose Mini-Monkey, a lightweight MLLM that incorporates a plug-and-play method called multi-scale adaptive crop strategy (MSAC). Mini-Monkey adaptively generates multi-scale representations, allowing it to select non-segmented objects from various scales. To mitigate the computational overhead introduced by MSAC, we propose a Scale Compression Mechanism (SCM), which effectively compresses image tokens. Mini-Monkey achieves state-of-the-art performance among 2B-parameter MLLMs. It not only demonstrates leading performance on a variety of general multimodal understanding tasks but also shows consistent improvements in document understanding capabilities. On the OCRBench, Mini-Monkey achieves a score of 802, outperforming 8B-parameter state-of-the-art model InternVL2-8B. Besides, our model and training strategy are very efficient, which can be trained with only eight RTX 3090. The code is available at https://github.com/Yuliang-Liu/Monkey.

Via

Access Paper or Ask Questions

LEGO: Self-Supervised Representation Learning for Scene Text Images

Aug 04, 2024

Yujin Ren, Jiaxin Zhang, Lianwen Jin

Figure 1 for LEGO: Self-Supervised Representation Learning for Scene Text Images

Figure 2 for LEGO: Self-Supervised Representation Learning for Scene Text Images

Figure 3 for LEGO: Self-Supervised Representation Learning for Scene Text Images

Figure 4 for LEGO: Self-Supervised Representation Learning for Scene Text Images

Abstract:In recent years, significant progress has been made in scene text recognition by data-driven methods. However, due to the scarcity of annotated real-world data, the training of these methods predominantly relies on synthetic data. The distribution gap between synthetic and real data constrains the further performance improvement of these methods in real-world applications. To tackle this problem, a highly promising approach is to utilize massive amounts of unlabeled real data for self-supervised training, which has been widely proven effective in many NLP and CV tasks. Nevertheless, generic self-supervised methods are unsuitable for scene text images due to their sequential nature. To address this issue, we propose a Local Explicit and Global Order-aware self-supervised representation learning method (LEGO) that accounts for the characteristics of scene text images. Inspired by the human cognitive process of learning words, which involves spelling, reading, and writing, we propose three novel pre-text tasks for LEGO to model sequential, semantic, and structural features, respectively. The entire pre-training process is optimized by using a consistent Text Knowledge Codebook. Extensive experiments validate that LEGO outperforms previous scene text self-supervised methods. The recognizer incorporated with our pre-trained model achieves superior or comparable performance compared to state-of-the-art scene text recognition methods on six benchmarks. Furthermore, we demonstrate that LEGO can achieve superior performance in other text-related tasks.

Via

Access Paper or Ask Questions

Mini-Monkey: Alleviate the Sawtooth Effect by Multi-Scale Adaptive Cropping

Aug 04, 2024

Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai

Figure 1 for Mini-Monkey: Alleviate the Sawtooth Effect by Multi-Scale Adaptive Cropping

Figure 2 for Mini-Monkey: Alleviate the Sawtooth Effect by Multi-Scale Adaptive Cropping

Figure 3 for Mini-Monkey: Alleviate the Sawtooth Effect by Multi-Scale Adaptive Cropping

Figure 4 for Mini-Monkey: Alleviate the Sawtooth Effect by Multi-Scale Adaptive Cropping

Via

Access Paper or Ask Questions

Generalized Tampered Scene Text Detection in the era of Generative AI

Jul 31, 2024

Chenfan Qu, Yiwu Zhong, Fengjun Guo, Lianwen Jin

Figure 1 for Generalized Tampered Scene Text Detection in the era of Generative AI

Figure 2 for Generalized Tampered Scene Text Detection in the era of Generative AI

Figure 3 for Generalized Tampered Scene Text Detection in the era of Generative AI

Figure 4 for Generalized Tampered Scene Text Detection in the era of Generative AI

Abstract:The rapid advancements of generative AI have fueled the potential of generative text image editing while simultaneously escalating the threat of misinformation spreading. However, existing forensics methods struggle to detect unseen forgery types that they have not been trained on, leaving the development of a model capable of generalized detection of tampered scene text as an unresolved issue. To tackle this, we propose a novel task: open-set tampered scene text detection, which evaluates forensics models on their ability to identify both seen and previously unseen forgery types. We have curated a comprehensive, high-quality dataset, featuring the texts tampered by eight text editing models, to thoroughly assess the open-set generalization capabilities. Further, we introduce a novel and effective pre-training paradigm that subtly alters the texture of selected texts within an image and trains the model to identify these regions. This approach not only mitigates the scarcity of high-quality training data but also enhances models' fine-grained perception and open-set generalization abilities. Additionally, we present DAF, a novel framework that improves open-set generalization by distinguishing between the features of authentic and tampered text, rather than focusing solely on the tampered text's features. Our extensive experiments validate the remarkable efficacy of our methods. For example, our zero-shot performance can even beat the previous state-of-the-art full-shot model by a large margin. Our dataset and code will be open-source.

Via

Access Paper or Ask Questions

TongGu: Mastering Classical Chinese Understanding with Knowledge-Grounded Large Language Models

Jul 04, 2024

Jiahuan Cao, Dezhi Peng, Peirong Zhang, Yongxin Shi, Yang Liu, Kai Ding, Lianwen Jin

Abstract:Classical Chinese is a gateway to the rich heritage and wisdom of ancient China, yet its complexities pose formidable comprehension barriers for most modern people without specialized knowledge. While Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), they struggle with Classical Chinese Understanding (CCU), especially in data-demanding and knowledge-intensive tasks. In response to this dilemma, we propose \textbf{TongGu} (mean understanding ancient and modern), the first CCU-specific LLM, underpinned by three core contributions. First, we construct a two-stage instruction-tuning dataset ACCN-INS derived from rich classical Chinese corpora, aiming to unlock the full CCU potential of LLMs. Second, we propose Redundancy-Aware Tuning (RAT) to prevent catastrophic forgetting, enabling TongGu to acquire new capabilities while preserving its foundational knowledge. Third, we present a CCU Retrieval-Augmented Generation (CCU-RAG) technique to reduce hallucinations based on knowledge-grounding. Extensive experiments across 24 diverse CCU tasks validate TongGu's superior ability, underscoring the effectiveness of RAT and CCU-RAG. The model and dataset will be public available.

Via

Access Paper or Ask Questions

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

Jun 27, 2024

Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, Lianwen Jin

Figure 1 for DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

Figure 2 for DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

Figure 3 for DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

Figure 4 for DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

Abstract:Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. DocKylin utilizes an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, DocKylin incorporates a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to create a compressed, adaptive visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks. Notably, both the proposed APS and DTS are parameter-free, facilitating easy integration into existing MLLMs, and our experiments indicate their potential for broader applications.

Via

Access Paper or Ask Questions

Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction

Jun 05, 2024

Pengjie Wang, Kaile Zhang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai, Yuliang Liu

Figure 1 for Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction

Figure 2 for Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction

Figure 3 for Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction

Figure 4 for Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction

Abstract:Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. However, due to the great antiquity of the era, a large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in the field of paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P$^3$), to decipher these enigmatic characters through radical reconstruction. We deconstruct OBI into foundational strokes and radicals, then employ a Transformer model to reconstruct them into their modern (conterpart)\textcolor{blue}{counterparts}, offering a groundbreaking solution to ancient script analysis. To further this endeavor, a new Ancient Chinese Character Puzzles (ACCP) dataset was developed, comprising an extensive collection of character images from seven key historical stages, annotated with detailed radical sequences. The experiments have showcased considerable promising insights, underscoring the potential and effectiveness of our approach in deciphering the intricacies of ancient Chinese scripts. Through this novel dataset and methodology, we aim to bridge the gap between traditional philology and modern document analysis techniques, offering new insights into the rich history of Chinese linguistic heritage.

* ICDAR 2024

Via

Access Paper or Ask Questions

Deciphering Oracle Bone Language with Diffusion Models

Jun 02, 2024

Haisu Guan, Huanxin Yang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai, Yuliang Liu

Figure 1 for Deciphering Oracle Bone Language with Diffusion Models

Figure 2 for Deciphering Oracle Bone Language with Diffusion Models

Figure 3 for Deciphering Oracle Bone Language with Diffusion Models

Figure 4 for Deciphering Oracle Bone Language with Diffusion Models

Abstract:Originating from China's Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a novel frontier for OBS decipherment, challenging traditional NLP methods that rely heavily on large textual corpora, a luxury not afforded by historical languages. This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD). Utilizing a conditional diffusion-based strategy, OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages. To validate its efficacy, extensive experiments were conducted on an oracle bone script dataset, with quantitative results demonstrating the effectiveness of OBSD. Code and decipherment results will be made available at https://github.com/guanhaisu/OBSD.

* ACL2024 main conference long paper

Via

Access Paper or Ask Questions

C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models

May 28, 2024

Jiahuan Cao, Yongxin Shi, Dezhi Peng, Yang Liu, Lianwen Jin

Abstract:Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of Large Language Models (LLMs) for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities of LLMs. To fill this gap, this paper introduces C$^{3}$bench, a Comprehensive Classical Chinese understanding benchmark, which comprises 50,000 text pairs for five primary CCU tasks, including classification, retrieval, named entity recognition, punctuation, and translation. Furthermore, the data in C$^{3}$bench originates from ten different domains, covering most of the categories in classical Chinese. Leveraging the proposed C$^{3}$bench, we extensively evaluate the quantitative performance of 15 representative LLMs on all five CCU tasks. Our results not only establish a public leaderboard of LLMs' CCU capabilities but also gain some findings. Specifically, existing LLMs are struggle with CCU tasks and still inferior to supervised models. Additionally, the results indicate that CCU is a task that requires special attention. We believe this study could provide a standard benchmark, comprehensive baselines, and valuable insights for the future advancement of LLM-based CCU research. The evaluation pipeline and dataset are available at \url{https://github.com/SCUT-DLVCLab/C3bench}.

* 4 figures and 5 tables

Via

Access Paper or Ask Questions