Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bin Wang

and Other Contributors

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

Oct 16, 2024

Zhiyuan Zhao, Hengrui Kang, Bin Wang, Conghui He

Figure 1 for DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

Figure 2 for DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

Figure 3 for DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

Figure 4 for DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

Abstract:Document Layout Analysis is crucial for real-world document understanding systems, but it encounters a challenging trade-off between speed and accuracy: multimodal methods leveraging both text and visual features achieve higher accuracy but suffer from significant latency, whereas unimodal methods relying solely on visual features offer faster processing speeds at the expense of accuracy. To address this dilemma, we introduce DocLayout-YOLO, a novel approach that enhances accuracy while maintaining speed advantages through document-specific optimizations in both pre-training and model design. For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm, which frames document synthesis as a two-dimensional bin packing problem, generating the large-scale, diverse DocSynth-300K dataset. Pre-training on the resulting DocSynth-300K dataset significantly improves fine-tuning performance across various document types. In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module that is capable of better handling multi-scale variations of document elements. Furthermore, to validate performance across different document types, we introduce a complex and challenging benchmark named DocStructBench. Extensive experiments on downstream datasets demonstrate that DocLayout-YOLO excels in both speed and accuracy. Code, data, and models are available at https://github.com/opendatalab/DocLayout-YOLO.

* Github Repo: https://github.com/opendatalab/DocLayout-YOLO

Via

Access Paper or Ask Questions

3-D Magnetotelluric Deep Learning Inversion Guided by Pseudo-Physical Information

Oct 12, 2024

Peifan Jiang, Xuben Wang, Shuang Wang, Fei Deng, Kunpeng Wang, Bin Wang, Yuhan Yang, Islam Fadel

Figure 1 for 3-D Magnetotelluric Deep Learning Inversion Guided by Pseudo-Physical Information

Figure 2 for 3-D Magnetotelluric Deep Learning Inversion Guided by Pseudo-Physical Information

Figure 3 for 3-D Magnetotelluric Deep Learning Inversion Guided by Pseudo-Physical Information

Figure 4 for 3-D Magnetotelluric Deep Learning Inversion Guided by Pseudo-Physical Information

Abstract:Magnetotelluric deep learning (DL) inversion methods based on joint data-driven and physics-driven have become a hot topic in recent years. When mapping observation data (or forward modeling data) to the resistivity model using neural networks (NNs), incorporating the error (loss) term of the inversion resistivity's forward modeling response--which introduces physical information about electromagnetic field propagation--can significantly enhance the inversion accuracy. To efficiently achieve data-physical dual-driven MT deep learning inversion for large-scale 3-D MT data, we propose using DL forward modeling networks to compute this portion of the loss. This approach introduces pseudo-physical information through the forward modeling of NN simulation, further guiding the inversion network fitting. Specifically, we first pre-train the forward modeling networks as fixed forward modeling operators, then transfer and integrate them into the inversion network training, and finally optimize the inversion network by minimizing the multinomial loss. Theoretical experimental results indicate that despite some simulation errors in DL forward modeling, the introduced pseudo-physical information still enhances inversion accuracy and significantly mitigates the overfitting problem during training. Additionally, we propose a new input mode that involves masking and adding noise to the data, simulating the field data environment of 3-D MT inversion, thereby making the method more flexible and effective for practical applications.

Via

Access Paper or Ask Questions

CALoR: Towards Comprehensive Model Inversion Defense

Oct 08, 2024

Hongyao Yu, Yixiang Qiu, Hao Fang, Bin Chen, Sijin Yu, Bin Wang, Shu-Tao Xia, Ke Xu

Figure 1 for CALoR: Towards Comprehensive Model Inversion Defense

Figure 2 for CALoR: Towards Comprehensive Model Inversion Defense

Figure 3 for CALoR: Towards Comprehensive Model Inversion Defense

Figure 4 for CALoR: Towards Comprehensive Model Inversion Defense

Abstract:Model Inversion Attacks (MIAs) aim at recovering privacy-sensitive training data from the knowledge encoded in the released machine learning models. Recent advances in the MIA field have significantly enhanced the attack performance under multiple scenarios, posing serious privacy risks of Deep Neural Networks (DNNs). However, the development of defense strategies against MIAs is relatively backward to resist the latest MIAs and existing defenses fail to achieve further trade-off between model utility and model robustness. In this paper, we provide an in-depth analysis from the perspective of intrinsic vulnerabilities of MIAs, comprehensively uncovering the weaknesses inherent in the basic pipeline, which are partially investigated in the previous defenses. Building upon these new insights, we propose a robust defense mechanism, integrating Confidence Adaptation and Low-Rank compression(CALoR). Our method includes a novel robustness-enhanced classification loss specially-designed for model inversion defenses and reveals the extraordinary effectiveness of compressing the classification header. With CALoR, we can mislead the optimization objective, reduce the leaked information and impede the backpropagation of MIAs, thus mitigating the risk of privacy leakage. Extensive experimental results demonstrate that our method achieves state-of-the-art (SOTA) defense performance against MIAs and exhibits superior generalization to existing defenses across various scenarios.

* 26 pages

Via

Access Paper or Ask Questions

MinerU: An Open-Source Solution for Precise Document Content Extraction

Sep 27, 2024

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang(+8 more)

Figure 1 for MinerU: An Open-Source Solution for Precise Document Content Extraction

Figure 2 for MinerU: An Open-Source Solution for Precise Document Content Extraction

Figure 3 for MinerU: An Open-Source Solution for Precise Document Content Extraction

Figure 4 for MinerU: An Open-Source Solution for Precise Document Content Extraction

Abstract:Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.

* MinerU Technical Report

Via

Access Paper or Ask Questions

PMSS: Pretrained Matrices Skeleton Selection for LLM Fine-tuning

Sep 25, 2024

Qibin Wang, Xiaolin Hu, Weikai Xu, Wei Liu, Jian Luan, Bin Wang

Figure 1 for PMSS: Pretrained Matrices Skeleton Selection for LLM Fine-tuning

Figure 2 for PMSS: Pretrained Matrices Skeleton Selection for LLM Fine-tuning

Figure 3 for PMSS: Pretrained Matrices Skeleton Selection for LLM Fine-tuning

Figure 4 for PMSS: Pretrained Matrices Skeleton Selection for LLM Fine-tuning

Abstract:Low-rank adaptation (LoRA) and its variants have recently gained much interest due to their ability to avoid excessive inference costs. However, LoRA still encounters the following challenges: (1) Limitation of low-rank assumption; and (2) Its initialization method may be suboptimal. To this end, we propose PMSS(Pre-trained Matrices Skeleton Selection), which enables high-rank updates with low costs while leveraging semantic and linguistic information inherent in pre-trained weight. It achieves this by selecting skeletons from the pre-trained weight matrix and only learning a small matrix instead. Experiments demonstrate that PMSS outperforms LoRA and other fine-tuning methods across tasks with much less trainable parameters. We demonstrate its effectiveness, especially in handling complex tasks such as DROP benchmark(+3.4%/+5.9% on LLaMA2-7B/13B) and math reasoning(+12.89%/+5.61%/+3.11% on LLaMA2-7B, Mistral-7B and Gemma-7B of GSM8K). The code and model will be released soon.

Via

Access Paper or Ask Questions

An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems

Sep 18, 2024

Peng Liu, Jiawei Zhu, Cong Xu, Ming Zhao, Bin Wang

Figure 1 for An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems

Figure 2 for An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems

Figure 3 for An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems

Figure 4 for An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems

Abstract:As the last key stage of Recommender Systems (RSs), Multi-Task Fusion (MTF) is in charge of combining multiple scores predicted by Multi-Task Learning (MTL) into a final score to maximize user satisfaction, which decides the ultimate recommendation results. In recent years, to maximize long-term user satisfaction within a recommendation session, Reinforcement Learning (RL) is widely used for MTF in large-scale RSs. However, limited by their modeling pattern, all the current RL-MTF methods can only utilize user features as the state to generate actions for each user, but unable to make use of item features and other valuable features, which leads to suboptimal results. Addressing this problem is a challenge that requires breaking through the current modeling pattern of RL-MTF. To solve this problem, we propose a novel method called Enhanced-State RL for MTF in RSs. Unlike the existing methods mentioned above, our method first defines user features, item features, and other valuable features collectively as the enhanced state; then proposes a novel actor and critic learning process to utilize the enhanced state to make much better action for each user-item pair. To the best of our knowledge, this novel modeling pattern is being proposed for the first time in the field of RL-MTF. We conduct extensive offline and online experiments in a large-scale RS. The results demonstrate that our model outperforms other models significantly. Enhanced-State RL has been fully deployed in our RS more than half a year, improving +3.84% user valid consumption and +0.58% user duration time compared to baseline.

* arXiv admin note: substantial text overlap with arXiv:2404.17589

Via

Access Paper or Ask Questions

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

Sep 10, 2024

Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw

Figure 1 for MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

Figure 2 for MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

Figure 3 for MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

Figure 4 for MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

Abstract:The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.

Via

Access Paper or Ask Questions

UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

Sep 08, 2024

Peng Xie, Minbo Ma, Bin Wang, Junbo Zhang, Tianrui Li

Figure 1 for UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

Figure 2 for UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

Figure 3 for UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

Figure 4 for UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

Abstract:Accurate prediction of metro Origin-Destination (OD) flow is essential for the development of intelligent transportation systems and effective urban traffic management. Existing approaches typically either predict passenger outflow of departure stations or inflow of destination stations. However, we argue that travelers generally have clearly defined departure and arrival stations, making these OD pairs inherently interconnected. Consequently, considering OD pairs as a unified entity more accurately reflects actual metro travel patterns and allows for analyzing potential spatio-temporal correlations between different OD pairs. To address these challenges, we propose a novel and effective urban metro OD flow prediction method (UMOD), comprising three core modules: a data embedding module, a temporal relation module, and a spatial relation module. The data embedding module projects raw OD pair inputs into hidden space representations, which are subsequently processed by the temporal and spatial relation modules to capture both inter-pair and intra-pair spatio-temporal dependencies. Experimental results on two real-world urban metro OD flow datasets demonstrate that adopting the OD pairs perspective is critical for accurate metro OD flow prediction. Our method outperforms existing approaches, delivering superior predictive performance.

Via

Access Paper or Ask Questions

CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation

Sep 05, 2024

Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Bo Zhang, Conghui He

Figure 1 for CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation

Figure 2 for CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation

Figure 3 for CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation

Figure 4 for CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation

Abstract:Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly sensitive to the distribution of training data, thereby causing the unfairness in formula recognition evaluation. To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing a image-level rather than LaTex-level metric score. Specifically, CDM renders both the model-predicted LaTeX and the ground-truth LaTeX formulas into image-formatted formulas, then employs visual feature extraction and localization techniques for precise character-level matching, incorporating spatial position information. Such a spatially-aware and character-matching method offers a more accurate and equitable evaluation compared with previous BLEU and Edit Distance metrics that rely solely on text-based character matching. Experimentally, we evaluated various formula recognition models using CDM, BLEU, and ExpRate metrics. Their results demonstrate that the CDM aligns more closely with human evaluation standards and provides a fairer comparison across different models by eliminating discrepancies caused by diverse formula representations.

* Project Website: https://github.com/opendatalab/UniMERNet/tree/main/cdm

Via

Access Paper or Ask Questions

ToolACE: Winning the Points of LLM Function Calling

Sep 02, 2024

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu(+17 more)

Figure 1 for ToolACE: Winning the Points of LLM Function Calling

Figure 2 for ToolACE: Winning the Points of LLM Function Calling

Figure 3 for ToolACE: Winning the Points of LLM Function Calling

Figure 4 for ToolACE: Winning the Points of LLM Function Calling

Abstract:Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE.

* 21 pages, 22 figures

Via

Access Paper or Ask Questions