Alert button
Picture for Han Shi

Han Shi

Alert button

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Sep 22, 2023
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.

* Technical Report, Work in Progress. Project Page: https://meta-math.github.io/ 
Viaarxiv icon

Forward-Backward Reasoning in Large Language Models for Verification

Aug 23, 2023
Weisen Jiang, Han Shi, Longhui Yu, Zhengying Liu, Yu Zhang, Zhenguo Li, James T. Kwok

Figure 1 for Forward-Backward Reasoning in Large Language Models for Verification
Figure 2 for Forward-Backward Reasoning in Large Language Models for Verification
Figure 3 for Forward-Backward Reasoning in Large Language Models for Verification
Figure 4 for Forward-Backward Reasoning in Large Language Models for Verification

Chain-of-Though (CoT) prompting has shown promising performance in various reasoning tasks. Recently, Self-Consistency \citep{wang2023selfconsistency} proposes to sample a diverse set of reasoning chains which may lead to different answers while the answer that receives the most votes is selected. In this paper, we propose a novel method to use backward reasoning in verifying candidate answers. We mask a token in the question by ${\bf x}$ and ask the LLM to predict the masked token when a candidate answer is provided by \textit{a simple template}, i.e., "\textit{\textbf{If we know the answer of the above question is \{a candidate answer\}, what is the value of unknown variable ${\bf x}$?}}" Intuitively, the LLM is expected to predict the masked token successfully if the provided candidate answer is correct. We further propose FOBAR to combine forward and backward reasoning for estimating the probability of candidate answers. We conduct extensive experiments on six data sets and three LLMs. Experimental results demonstrate that FOBAR achieves state-of-the-art performance on various reasoning benchmarks.

* Preprint 
Viaarxiv icon

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

Aug 22, 2023
Xinchi Deng, Han Shi, Runhui Huang, Changlin Li, Hang Xu, Jianhua Han, James Kwok, Shen Zhao, Wei Zhang, Xiaodan Liang

Figure 1 for GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
Figure 2 for GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
Figure 3 for GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
Figure 4 for GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks, benefiting from massive image-text pairs collected from the Internet. In practice, online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. Existing works on cross-modal pre-training mainly focus on training a network with fixed architecture. However, it is impractical to limit the model capacity when considering the continuously growing nature of pre-training data in real-world applications. On the other hand, it is important to utilize the knowledge in the current model to obtain efficient training and better performance. To address the above issues, in this paper, we propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input. Specially, we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios. And the shared encoder is proposed in our growth space to enhance the degree of cross-modal fusion. Besides, we explore the effect of growth in different dimensions, which could provide future references for the design of cross-modal model architecture. Finally, we employ parameter inheriting with momentum (PIM) to maintain the previous knowledge and address the issue of the local minimum dilemma. Compared with the existing methods, GrowCLIP improves 2.3% average top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for zero-shot image retrieval, GrowCLIP can improve 1.2% for top-1 image-to-text recall on Flickr30K dataset.

* Accepted by ICCV2023 
Viaarxiv icon

Backward Reasoning in Large Language Models for Verification

Aug 15, 2023
Weisen Jiang, Han Shi, Longhui Yu, Zhengying Liu, Yu Zhang, Zhenguo Li, James T. Kwok

Figure 1 for Backward Reasoning in Large Language Models for Verification
Figure 2 for Backward Reasoning in Large Language Models for Verification
Figure 3 for Backward Reasoning in Large Language Models for Verification
Figure 4 for Backward Reasoning in Large Language Models for Verification

Chain-of-Though (CoT) prompting has shown promising performance in various reasoning tasks. Recently, Self-Consistency \citep{wang2023selfconsistency} proposes to sample a diverse set of reasoning chains which may lead to different answers while the answer that receives the most votes is selected. In this paper, we propose a novel method to use backward reasoning in verifying candidate answers. We mask a token in the question by ${\bf x}$ and ask the LLM to predict the masked token when a candidate answer is provided by \textit{a simple template}, i.e., ``\textit{\textbf{If we know the answer of the above question is \{a candidate answer\}, what is the value of unknown variable ${\bf x}$?}}'' Intuitively, the LLM is expected to predict the masked token successfully if the provided candidate answer is correct. We further propose FOBAR to combine forward and backward reasoning for estimating the probability of candidate answers. We conduct extensive experiments on six data sets and three LLMs. Experimental results demonstrate that FOBAR achieves state-of-the-art performance on various reasoning benchmarks.

* Preprint 
Viaarxiv icon

DiffFlow: A Unified SDE Framework for Score-Based Diffusion Models and Generative Adversarial Networks

Jul 05, 2023
Jingwei Zhang, Han Shi, Jincheng Yu, Enze Xie, Zhenguo Li

Figure 1 for DiffFlow: A Unified SDE Framework for Score-Based Diffusion Models and Generative Adversarial Networks

Generative models can be categorized into two types: explicit generative models that define explicit density forms and allow exact likelihood inference, such as score-based diffusion models (SDMs) and normalizing flows; implicit generative models that directly learn a transformation from the prior to the data distribution, such as generative adversarial nets (GANs). While these two types of models have shown great success, they suffer from respective limitations that hinder them from achieving fast sampling and high sample quality simultaneously. In this paper, we propose a unified theoretic framework for SDMs and GANs. We shown that: i) the learning dynamics of both SDMs and GANs can be described as a novel SDE named Discriminator Denoising Diffusion Flow (DiffFlow) where the drift can be determined by some weighted combinations of scores of the real data and the generated data; ii) By adjusting the relative weights between different score terms, we can obtain a smooth transition between SDMs and GANs while the marginal distribution of the SDE remains invariant to the change of the weights; iii) we prove the asymptotic optimality and maximal likelihood training scheme of the DiffFlow dynamics; iv) under our unified theoretic framework, we introduce several instantiations of the DiffFLow that provide new algorithms beyond GANs and SDMs with exact likelihood inference and have potential to achieve flexible trade-off between high sample quality and fast sampling speed.

* Tech Report 
Viaarxiv icon

Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

Jun 04, 2023
Dezhan Tu, Yeye He, Weiwei Cui, Song Ge, Haidong Zhang, Han Shi, Dongmei Zhang, Surajit Chaudhuri

Figure 1 for Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines
Figure 2 for Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines
Figure 3 for Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines
Figure 4 for Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

Data pipelines are widely employed in modern enterprises to power a variety of Machine-Learning (ML) and Business-Intelligence (BI) applications. Crucially, these pipelines are \emph{recurring} (e.g., daily or hourly) in production settings to keep data updated so that ML models can be re-trained regularly, and BI dashboards refreshed frequently. However, data quality (DQ) issues can often creep into recurring pipelines because of upstream schema and data drift over time. As modern enterprises operate thousands of recurring pipelines, today data engineers have to spend substantial efforts to \emph{manually} monitor and resolve DQ issues, as part of their DataOps and MLOps practices. Given the high human cost of managing large-scale pipeline operations, it is imperative that we can \emph{automate} as much as possible. In this work, we propose Auto-Validate-by-History (AVH) that can automatically detect DQ issues in recurring pipelines, leveraging rich statistics from historical executions. We formalize this as an optimization problem, and develop constant-factor approximation algorithms with provable precision guarantees. Extensive evaluations using 2000 production data pipelines at Microsoft demonstrate the effectiveness and efficiency of AVH.

* full version of a paper accepted to KDD 2023 
Viaarxiv icon

DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning

May 04, 2023
Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, Zhenguo Li

Figure 1 for DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning
Figure 2 for DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning
Figure 3 for DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning
Figure 4 for DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning

Diffusion models have proven to be highly effective in generating high-quality images. However, adapting large pre-trained diffusion models to new domains remains an open challenge, which is critical for real-world applications. This paper proposes DiffFit, a parameter-efficient strategy to fine-tune large pre-trained diffusion models that enable fast adaptation to new domains. DiffFit is embarrassingly simple that only fine-tunes the bias term and newly-added scaling factors in specific layers, yet resulting in significant training speed-up and reduced model storage costs. Compared with full fine-tuning, DiffFit achieves 2$\times$ training speed-up and only needs to store approximately 0.12\% of the total model parameters. Intuitive theoretical analysis has been provided to justify the efficacy of scaling factors on fast adaptation. On 8 downstream datasets, DiffFit achieves superior or competitive performances compared to the full fine-tuning while being more efficient. Remarkably, we show that DiffFit can adapt a pre-trained low-resolution generative model to a high-resolution one by adding minimal cost. Among diffusion-based methods, DiffFit sets a new state-of-the-art FID of 3.02 on ImageNet 512$\times$512 benchmark by fine-tuning only 25 epochs from a public pre-trained ImageNet 256$\times$256 checkpoint while being 30$\times$ more training efficient than the closest competitor.

* Tech Report 
Viaarxiv icon