A real-world information extraction (IE) system for semi-structured document images often involves a long pipeline of multiple modules, whose complexity dramatically increases its development and maintenance cost. One can instead consider an end-to-end model that directly maps the input to the target output and simplify the entire process. However, such generation approach is known to lead to unstable performance if not designed carefully. Here we present our recent effort on transitioning from our existing pipeline-based IE system to an end-to-end system focusing on practical challenges that are associated with replacing and deploying the system in real, large-scale production. By carefully formulating document IE as a sequence generation task, we show that a single end-to-end IE system can be built and still achieve competent performance.
The restricted Boltzmann machine (RBM) is a representative generative model based on the concept of statistical mechanics. In spite of the strong merit of interpretability, unavailability of backpropagation makes it less competitive than other generative models. Here we derive differentiable loss functions for both binary and multinary RBMs. Then we demonstrate their learnability and performance by generating colored face images.
Deep learning approaches to semantic parsing require a large amount of labeled data, but annotating complex logical forms is costly. Here, we propose Syntactic Question Abstraction and Retrieval (SQAR), a method to build a neural semantic parser that translates a natural language (NL) query to a SQL logical form (LF) with less than 1,000 annotated examples. SQAR first retrieves a logical pattern from the train data by computing the similarity between NL queries and then grounds a lexical information on the retrieved pattern in order to generate the final LF. We validate SQAR by training models using various small subsets of WikiSQL train data achieving up to 4.9% higher LF accuracy compared to the previous state-of-the-art models on WikiSQL test set. We also show that by using query-similarity to retrieve logical pattern, SQAR can leverage a paraphrasing dataset achieving up to 5.9% higher LF accuracy compared to the case where SQAR is trained by using only WikiSQL data. In contrast to a simple pattern classification approach, SQAR can generate unseen logical patterns upon the addition of new examples without re-training the model. We also discuss an ideal way to create cost efficient and robust train datasets when the data distribution can be approximated under a data-hungry setting.
Information Extraction (IE) for document images is often approached as a BIO tagging problem, where the model sequentially goes through and classifies each recognized input token into one of the information categories. However, such problem setup has two inherent limitations that (1) it can only extract a flat list of information and (2) it assumes that the input data is serialized, often by a simple rule-based script. Nevertheless, real-world documents often contain hierarchical information in the form of two-dimensional language data in which the serialization can be highly non-trivial. To tackle these issues, we propose SPADE$\spadesuit$ (SPatial DEpendency parser), an end-to-end spatial dependency parser that is serializer-free and capable of modeling an arbitrary number of information layers, making it suitable for parsing structure-rich documents such as receipts and multimodal documents such as name cards. We show that SPADE$\spadesuit$ outperforms the previous BIO tagging-based approach on name card parsing task and achieves comparable performance on receipt parsing task. Especially, when the receipt images have non-flat manifold representing physical distortion of receipt paper in real-world, SPADE$\spadesuit$ outperforms the tagging-based method by a large margin of 25.8% highlighting the strong performance of SPADE$\spadesuit$ over spatially complex document.
WikiSQL is the task of mapping a natural language question to a SQL query given a table from a Wikipedia article. We first show that learning highly context- and table-aware word representations is arguably the most important consideration for achieving a high accuracy in the task. We explore three variants of BERT-based architecture and our best model outperforms the previous state of the art by 8.2% and 2.5% in logical form and execution accuracy, respectively. We provide a detailed analysis of the models to guide how word contextualization can be utilized in a such semantic parsing task. We then argue that this score is near the upper bound in WikiSQL, where we observe that the most of the evaluation errors are due to wrong annotations. We also measure human accuracy on a portion of the dataset and show that our model exceeds the human performance, at least by 1.4% execution accuracy.