Alert button
Picture for Guoqing Zheng

Guoqing Zheng

Alert button

Hybrid Retrieval-Augmented Generation for Real-time Composition Assistance

Aug 08, 2023
Xuchao Zhang, Menglin Xia, Camille Couturier, Guoqing Zheng, Saravan Rajmohan, Victor Ruhle

Retrieval augmented models show promise in enhancing traditional language models by improving their contextual understanding, integrating private data, and reducing hallucination. However, the processing time required for retrieval augmented large language models poses a challenge when applying them to tasks that require real-time responses, such as composition assistance. To overcome this limitation, we propose the Hybrid Retrieval-Augmented Generation (HybridRAG) framework that leverages a hybrid setting that combines both client and cloud models. HybridRAG incorporates retrieval-augmented memory generated asynchronously by a Large Language Model (LLM) in the cloud. By integrating this retrieval augmented memory, the client model acquires the capability to generate highly effective responses, benefiting from the LLM's capabilities. Furthermore, through asynchronous memory integration, the client model is capable of delivering real-time responses to user requests without the need to wait for memory synchronization from the cloud. Our experiments on Wikitext and Pile subsets show that HybridRAG achieves lower latency than a cloud-based retrieval-augmented LLM, while outperforming client-only models in utility.

Viaarxiv icon

Fed-ZERO: Efficient Zero-shot Personalization with Federated Mixture of Experts

Jun 14, 2023
Chen Dun, Mirian Hipolito Garcia, Guoqing Zheng, Ahmed Hassan Awadallah, Robert Sim, Anastasios Kyrillidis, Dimitrios Dimitriadis

Figure 1 for Fed-ZERO: Efficient Zero-shot Personalization with Federated Mixture of Experts
Figure 2 for Fed-ZERO: Efficient Zero-shot Personalization with Federated Mixture of Experts
Figure 3 for Fed-ZERO: Efficient Zero-shot Personalization with Federated Mixture of Experts
Figure 4 for Fed-ZERO: Efficient Zero-shot Personalization with Federated Mixture of Experts

One of the goals in Federated Learning (FL) is to create personalized models that can adapt to the context of each participating client, while utilizing knowledge from a shared global model. Yet, often, personalization requires a fine-tuning step using clients' labeled data in order to achieve good performance. This may not be feasible in scenarios where incoming clients are fresh and/or have privacy concerns. It, then, remains open how one can achieve zero-shot personalization in these scenarios. We propose a novel solution by using a Mixture-of-Experts (MoE) framework within a FL setup. Our method leverages the diversity of the clients to train specialized experts on different subsets of classes, and a gating function to route the input to the most relevant expert(s). Our gating function harnesses the knowledge of a pretrained model common expert to enhance its routing decisions on-the-fly. As a highlight, our approach can improve accuracy up to 18\% in state of the art FL settings, while maintaining competitive zero-shot performance. In practice, our method can handle non-homogeneous data distributions, scale more efficiently, and improve the state-of-the-art performance on common FL benchmarks.

* 14 Pages 
Viaarxiv icon

Learning with Few Labeled Nodes via Augmented Graph Self-Training

Aug 26, 2022
Kaize Ding, Elnaz Nouri, Guoqing Zheng, Huan Liu, Ryen White

Figure 1 for Learning with Few Labeled Nodes via Augmented Graph Self-Training
Figure 2 for Learning with Few Labeled Nodes via Augmented Graph Self-Training
Figure 3 for Learning with Few Labeled Nodes via Augmented Graph Self-Training
Figure 4 for Learning with Few Labeled Nodes via Augmented Graph Self-Training

It is well known that the success of graph neural networks (GNNs) highly relies on abundant human-annotated data, which is laborious to obtain and not always available in practice. When only few labeled nodes are available, how to develop highly effective GNNs remains understudied. Though self-training has been shown to be powerful for semi-supervised learning, its application on graph-structured data may fail because (1) larger receptive fields are not leveraged to capture long-range node interactions, which exacerbates the difficulty of propagating feature-label patterns from labeled nodes to unlabeled nodes; and (2) limited labeled data makes it challenging to learn well-separated decision boundaries for different node classes without explicitly capturing the underlying semantic structure. To address the challenges of capturing informative structural and semantic knowledge, we propose a new graph data augmentation framework, AGST (Augmented Graph Self-Training), which is built with two new (i.e., structural and semantic) augmentation modules on top of a decoupled GST backbone. In this work, we investigate whether this novel framework can learn an effective graph predictive model with extremely limited labeled nodes. We conduct comprehensive evaluations on semi-supervised node classification under different scenarios of limited labeled-node data. The experimental results demonstrate the unique contributions of the novel data augmentation framework for node classification with few labeled data.

* Under Review 
Viaarxiv icon

ADMoE: Anomaly Detection with Mixture-of-Experts from Noisy Labels

Aug 24, 2022
Yue Zhao, Guoqing Zheng, Subhabrata Mukherjee, Robert McCann, Ahmed Awadallah

Figure 1 for ADMoE: Anomaly Detection with Mixture-of-Experts from Noisy Labels
Figure 2 for ADMoE: Anomaly Detection with Mixture-of-Experts from Noisy Labels
Figure 3 for ADMoE: Anomaly Detection with Mixture-of-Experts from Noisy Labels
Figure 4 for ADMoE: Anomaly Detection with Mixture-of-Experts from Noisy Labels

Existing works on anomaly detection (AD) rely on clean labels from human annotators that are expensive to acquire in practice. In this work, we propose a method to leverage weak/noisy labels (e.g., risk scores generated by machine rules for detecting malware) that are cheaper to obtain for anomaly detection. Specifically, we propose ADMoE, the first framework for anomaly detection algorithms to learn from noisy labels. In a nutshell, ADMoE leverages mixture-of-experts (MoE) architecture to encourage specialized and scalable learning from multiple noisy sources. It captures the similarities among noisy labels by sharing most model parameters, while encouraging specialization by building "expert" sub-networks. To further juice out the signals from noisy labels, ADMoE uses them as input features to facilitate expert learning. Extensive results on eight datasets (including a proprietary enterprise security dataset) demonstrate the effectiveness of ADMoE, where it brings up to 34% performance improvement over not using it. Also, it outperforms a total of 13 leading baselines with equivalent network parameters and FLOPS. Notably, ADMoE is model-agnostic to enable any neural network-based detection methods to handle noisy labels, where we showcase its results on both multiple-layer perceptron (MLP) and the leading AD method DeepSAD.

Viaarxiv icon

Pathologies of Pre-trained Language Models in Few-shot Fine-tuning

Apr 17, 2022
Hanjie Chen, Guoqing Zheng, Ahmed Hassan Awadallah, Yangfeng Ji

Figure 1 for Pathologies of Pre-trained Language Models in Few-shot Fine-tuning
Figure 2 for Pathologies of Pre-trained Language Models in Few-shot Fine-tuning
Figure 3 for Pathologies of Pre-trained Language Models in Few-shot Fine-tuning
Figure 4 for Pathologies of Pre-trained Language Models in Few-shot Fine-tuning

Although adapting pre-trained language models with few examples has shown promising performance on text classification, there is a lack of understanding of where the performance gain comes from. In this work, we propose to answer this question by interpreting the adaptation behavior using post-hoc explanations from model predictions. By modeling feature statistics of explanations, we discover that (1) without fine-tuning, pre-trained models (e.g. BERT and RoBERTa) show strong prediction bias across labels; (2) although few-shot fine-tuning can mitigate the prediction bias and demonstrate promising prediction performance, our analysis shows models gain performance improvement by capturing non-task-related features (e.g. stop words) or shallow data patterns (e.g. lexical overlaps). These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior, which requires further sanity check on model predictions and careful design in model evaluations in few-shot fine-tuning.

* ACL 2022 Workshop on Insights from Negative Results in NLP 
Viaarxiv icon

Knowledge Infused Decoding

Apr 06, 2022
Ruibo Liu, Guoqing Zheng, Shashank Gupta, Radhika Gaonkar, Chongyang Gao, Soroush Vosoughi, Milad Shokouhi, Ahmed Hassan Awadallah

Figure 1 for Knowledge Infused Decoding
Figure 2 for Knowledge Infused Decoding
Figure 3 for Knowledge Infused Decoding
Figure 4 for Knowledge Infused Decoding

Pre-trained language models (LMs) have been shown to memorize a substantial amount of knowledge from the pre-training corpora; however, they are still limited in recalling factually correct knowledge given a certain context. Hence, they tend to suffer from counterfactual or hallucinatory generation when used in knowledge-intensive natural language generation (NLG) tasks. Recent remedies to this problem focus on modifying either the pre-training or task fine-tuning objectives to incorporate knowledge, which normally require additional costly training or architecture modification of LMs for practical applications. We present Knowledge Infused Decoding (KID) -- a novel decoding algorithm for generative LMs, which dynamically infuses external knowledge into each step of the LM decoding. Specifically, we maintain a local knowledge memory based on the current context, interacting with a dynamically created external knowledge trie, and continuously update the local memory as a knowledge-aware constraint to guide decoding via reinforcement learning. On six diverse knowledge-intensive NLG tasks, task-agnostic LMs (e.g., GPT-2 and BART) armed with KID outperform many task-optimized state-of-the-art models, and show particularly strong performance in few-shot scenarios over seven related knowledge-infusion techniques. Human evaluation confirms KID's ability to generate more relevant and factual language for the input context when compared with multiple baselines. Finally, KID also alleviates exposure bias and provides stable generation quality when generating longer sequences. Code for KID is available at https://github.com/microsoft/KID.

* In ICLR 2022 
Viaarxiv icon

CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

Nov 04, 2021
Subhabrata Mukherjee, Xiaodong Liu, Guoqing Zheng, Saghar Hosseini, Hao Cheng, Greg Yang, Christopher Meek, Ahmed Hassan Awadallah, Jianfeng Gao

Figure 1 for CLUES: Few-Shot Learning Evaluation in Natural Language Understanding
Figure 2 for CLUES: Few-Shot Learning Evaluation in Natural Language Understanding
Figure 3 for CLUES: Few-Shot Learning Evaluation in Natural Language Understanding
Figure 4 for CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

Most recent progress in natural language understanding (NLU) has been driven, in part, by benchmarks such as GLUE, SuperGLUE, SQuAD, etc. In fact, many NLU models have now matched or exceeded "human-level" performance on many tasks in these benchmarks. Most of these benchmarks, however, give models access to relatively large amounts of labeled data for training. As such, the models are provided far more data than required by humans to achieve strong performance. That has motivated a line of work that focuses on improving few-shot learning performance of NLU models. However, there is a lack of standardized evaluation benchmarks for few-shot NLU resulting in different experimental settings in different papers. To help accelerate this line of work, we introduce CLUES (Constrained Language Understanding Evaluation Standard), a benchmark for evaluating the few-shot learning capabilities of NLU models. We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks. We also demonstrate differences between alternative model families and adaptation techniques in the few shot setting. Finally, we discuss several principles and choices in designing the experimental settings for evaluating the true few-shot learning performance and suggest a unified standardized approach to few-shot learning evaluation. We aim to encourage research on NLU models that can generalize to new tasks with a small number of examples. Code and data for CLUES are available at https://github.com/microsoft/CLUES.

* NeurIPS 2021 Datasets and Benchmarks Track 
Viaarxiv icon

A Conditional Generative Matching Model for Multi-lingual Reply Suggestion

Sep 15, 2021
Budhaditya Deb, Guoqing Zheng, Milad Shokouhi, Ahmed Hassan Awadallah

Figure 1 for A Conditional Generative Matching Model for Multi-lingual Reply Suggestion
Figure 2 for A Conditional Generative Matching Model for Multi-lingual Reply Suggestion
Figure 3 for A Conditional Generative Matching Model for Multi-lingual Reply Suggestion
Figure 4 for A Conditional Generative Matching Model for Multi-lingual Reply Suggestion

We study the problem of multilingual automated reply suggestions (RS) model serving many languages simultaneously. Multilingual models are often challenged by model capacity and severe data distribution skew across languages. While prior works largely focus on monolingual models, we propose Conditional Generative Matching models (CGM), optimized within a Variational Autoencoder framework to address challenges arising from multi-lingual RS. CGM does so with expressive message conditional priors, mixture densities to enhance multi-lingual data representation, latent alignment for language discrimination, and effective variational optimization techniques for training multi-lingual RS. The enhancements result in performance that exceed competitive baselines in relevance (ROUGE score) by more than 10\% on average, and 16\% for low resource languages. CGM also shows remarkable improvements in diversity (80\%) illustrating its expressiveness in representation of multi-lingual data.

Viaarxiv icon

MetaXT: Meta Cross-Task Transfer between Disparate Label Spaces

Sep 09, 2021
Srinagesh Sharma, Guoqing Zheng, Ahmed Hassan Awadallah

Figure 1 for MetaXT: Meta Cross-Task Transfer between Disparate Label Spaces
Figure 2 for MetaXT: Meta Cross-Task Transfer between Disparate Label Spaces
Figure 3 for MetaXT: Meta Cross-Task Transfer between Disparate Label Spaces
Figure 4 for MetaXT: Meta Cross-Task Transfer between Disparate Label Spaces

Albeit the universal representational power of pre-trained language models, adapting them onto a specific NLP task still requires a considerably large amount of labeled data. Effective task fine-tuning meets challenges when only a few labeled examples are present for the task. In this paper, we aim to the address of the problem of few shot task learning by exploiting and transferring from a different task which admits a related but disparate label space. Specifically, we devise a label transfer network (LTN) to transform the labels from source task to the target task of interest for training. Both the LTN and the model for task prediction are learned via a bi-level optimization framework, which we term as MetaXT. MetaXT offers a principled solution to best adapt a pre-trained language model to the target task by transferring knowledge from the source task. Empirical evaluations on cross-task transfer settings for four NLP tasks, from two different types of label space disparities, demonstrate the effectiveness of MetaXT, especially when the labeled data in the target task is limited.

Viaarxiv icon

WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding

Aug 28, 2021
Guoqing Zheng, Giannis Karamanolakis, Kai Shu, Ahmed Hassan Awadallah

Figure 1 for WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding
Figure 2 for WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding
Figure 3 for WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding
Figure 4 for WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding

Building quality machine learning models for natural language understanding (NLU) tasks relies heavily on labeled data. Weak supervision has been shown to provide valuable supervision when large amount of labeled data is unavailable or expensive to obtain. Existing works studying weak supervision for NLU either mostly focus on a specific task or simulate weak supervision signals from ground-truth labels. To date a benchmark for NLU with real world weak supervision signals for a collection of NLU tasks is still not available. In this paper, we propose such a benchmark, named WALNUT, to advocate and facilitate research on weak supervision for NLU. WALNUT consists of NLU tasks with different types, including both document-level prediction tasks and token-level prediction tasks and for each task contains weak labels generated by multiple real-world weak sources. We conduct baseline evaluations on the benchmark to systematically test the value of weak supervision for NLU tasks, with various weak supervision methods and model architectures. We demonstrate the benefits of weak supervision for low-resource NLU tasks and expect WALNUT to stimulate further research on methodologies to best leverage weak supervision. The benchmark and code for baselines will be publicly available at aka.ms/walnut_benchmark.

Viaarxiv icon