Alert button
Picture for Mingjie Zhan

Mingjie Zhan

Alert button

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Aug 15, 2023
Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, Hongsheng Li

Figure 1 for Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
Figure 2 for Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
Figure 3 for Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
Figure 4 for Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the \textit{Code Usage Frequency} of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\% $\to$ 84.3\%)}.

* Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification 
Viaarxiv icon

VCSUM: A Versatile Chinese Meeting Summarization Dataset

May 15, 2023
Han Wu, Mingjie Zhan, Haochen Tan, Zhaohui Hou, Ding Liang, Linqi Song

Figure 1 for VCSUM: A Versatile Chinese Meeting Summarization Dataset
Figure 2 for VCSUM: A Versatile Chinese Meeting Summarization Dataset
Figure 3 for VCSUM: A Versatile Chinese Meeting Summarization Dataset
Figure 4 for VCSUM: A Versatile Chinese Meeting Summarization Dataset

Compared to news and chat summarization, the development of meeting summarization is hugely decelerated by the limited data. To this end, we introduce a versatile Chinese meeting summarization dataset, dubbed VCSum, consisting of 239 real-life meetings, with a total duration of over 230 hours. We claim our dataset is versatile because we provide the annotations of topic segmentation, headlines, segmentation summaries, overall meeting summaries, and salient sentences for each meeting transcript. As such, the dataset can adapt to various summarization tasks or methods, including segmentation-based summarization, multi-granularity summarization and retrieval-then-generate summarization. Our analysis confirms the effectiveness and robustness of VCSum. We also provide a set of benchmark models regarding different downstream summarization tasks on VCSum to facilitate further research. The dataset and code will be released at https://github.com/hahahawu/VCSum.

* Findings of ACL 2023 (long paper). GitHub: https://github.com/hahahawu/VCSum 
Viaarxiv icon

Self-Supervised Sentence Compression for Meeting Summarization

May 13, 2023
Haochen Tan, Han Wu, Wei Shao, Xinyun Zhang, Mingjie Zhan, Zhaohui Hou, Ding Liang, Linqi Song

Figure 1 for Self-Supervised Sentence Compression for Meeting Summarization
Figure 2 for Self-Supervised Sentence Compression for Meeting Summarization
Figure 3 for Self-Supervised Sentence Compression for Meeting Summarization
Figure 4 for Self-Supervised Sentence Compression for Meeting Summarization

The conventional summarization model often fails to capture critical information in meeting transcripts, as meeting corpus usually involves multiple parties with lengthy conversations and is stuffed with redundant and trivial content. To tackle this problem, we present SVB, an effective and efficient framework for meeting summarization that `compress' the redundancy while preserving important content via three processes: sliding-window dialogue restoration and \textbf{S}coring, channel-wise importance score \textbf{V}oting, and relative positional \textbf{B}ucketing. Specifically, under the self-supervised paradigm, the sliding-window scoring aims to rate the importance of each token from multiple views. Then these ratings are aggregated by channel-wise voting. Tokens with high ratings will be regarded as salient information and labeled as \textit{anchors}. Finally, to tailor the lengthy input to an acceptable length for the language model, the relative positional bucketing algorithm is performed to retain the anchors while compressing other irrelevant contents in different granularities. Without large-scale pre-training or expert-grade annotating tools, our proposed method outperforms previous state-of-the-art approaches. A vast amount of evaluations and analyses are conducted to prove the effectiveness of our method.

Viaarxiv icon

Towards Versatile and Efficient Visual Knowledge Injection into Pre-trained Language Models with Cross-Modal Adapters

May 12, 2023
Xinyun Zhang, Haochen Tan, Han Wu, Mingjie Zhan, Ding Liang, Bei Yu

Figure 1 for Towards Versatile and Efficient Visual Knowledge Injection into Pre-trained Language Models with Cross-Modal Adapters
Figure 2 for Towards Versatile and Efficient Visual Knowledge Injection into Pre-trained Language Models with Cross-Modal Adapters
Figure 3 for Towards Versatile and Efficient Visual Knowledge Injection into Pre-trained Language Models with Cross-Modal Adapters
Figure 4 for Towards Versatile and Efficient Visual Knowledge Injection into Pre-trained Language Models with Cross-Modal Adapters

Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained language models (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-language models (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.

Viaarxiv icon

Learning Locality and Isotropy in Dialogue Modeling

May 29, 2022
Han Wu, Haochen Tan, Mingjie Zhan, Gangming Zhao, Shaoqing Lu, Ding Liang, Linqi Song

Figure 1 for Learning Locality and Isotropy in Dialogue Modeling
Figure 2 for Learning Locality and Isotropy in Dialogue Modeling
Figure 3 for Learning Locality and Isotropy in Dialogue Modeling
Figure 4 for Learning Locality and Isotropy in Dialogue Modeling

Existing dialogue modeling methods have achieved promising performance on various dialogue tasks with the aid of Transformer and the large-scale pre-trained language models. However, some recent studies revealed that the context representations produced by these methods suffer the problem of anisotropy. In this paper, we find that the generated representations are also not conversational, losing the conversation structure information during the context modeling stage. To this end, we identify two properties in dialogue modeling, i.e., locality and isotropy, and present a simple method for dialogue representation calibration, namely SimDRC, to build isotropic and conversational feature spaces. Experimental results show that our approach significantly outperforms the current state-of-the-art models on three dialogue tasks across the automatic and human evaluation metrics. More in-depth analyses further confirm the effectiveness of our proposed approach.

* 18 pages, 4 figures 
Viaarxiv icon

GroupLink: An End-to-end Multitask Method for Word Grouping and Relation Extraction in Form Understanding

May 10, 2021
Zilong Wang, Mingjie Zhan, Houxing Ren, Zhaohui Hou, Yuwei Wu, Xingyan Zhang, Ding Liang

Figure 1 for GroupLink: An End-to-end Multitask Method for Word Grouping and Relation Extraction in Form Understanding
Figure 2 for GroupLink: An End-to-end Multitask Method for Word Grouping and Relation Extraction in Form Understanding
Figure 3 for GroupLink: An End-to-end Multitask Method for Word Grouping and Relation Extraction in Form Understanding
Figure 4 for GroupLink: An End-to-end Multitask Method for Word Grouping and Relation Extraction in Form Understanding

Forms are a common type of document in real life and carry rich information through textual contents and the organizational structure. To realize automatic processing of forms, word grouping and relation extraction are two fundamental and crucial steps after preliminary processing of optical character reader (OCR). Word grouping is to aggregate words that belong to the same semantic entity, and relation extraction is to predict the links between semantic entities. Existing works treat them as two individual tasks, but these two tasks are correlated and can reinforce each other. The grouping process will refine the integrated representation of the corresponding entity, and the linking process will give feedback to the grouping performance. For this purpose, we acquire multimodal features from both textual data and layout information and build an end-to-end model through multitask training to combine word grouping and relation extraction to enhance performance on each task. We validate our proposed method on a real-world, fully-annotated, noisy-scanned benchmark, FUNSD, and extensive experiments demonstrate the effectiveness of our method.

Viaarxiv icon

DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding

Oct 15, 2020
Zilong Wang, Mingjie Zhan, Xuebo Liu, Ding Liang

Figure 1 for DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding
Figure 2 for DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding
Figure 3 for DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding
Figure 4 for DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding

Form understanding depends on both textual contents and organizational structure. Although modern OCR performs well, it is still challenging to realize general form understanding because forms are commonly used and of various formats. The table detection and handcrafted features in previous works cannot apply to all forms because of their requirements on formats. Therefore, we concentrate on the most elementary components, the key-value pairs, and adopt multimodal methods to extract features. We consider the form structure as a tree-like or graph-like hierarchy of text fragments. The parent-child relation corresponds to the key-value pairs in forms. We utilize the state-of-the-art models and design targeted extraction modules to extract multimodal features from semantic contents, layout information, and visual images. A hybrid fusion method of concatenation and feature shifting is designed to fuse the heterogeneous features and provide an informative joint representation. We adopt an asymmetric algorithm and negative sampling in our model as well. We validate our method on two benchmarks, MedForm and FUNSD, and extensive experiments demonstrate the effectiveness of our method.

* Accepted to EMNLP 2020 Findings 
Viaarxiv icon