Alert button
Picture for Qingyang Wu

Qingyang Wu

Alert button

DiactTOD: Learning Generalizable Latent Dialogue Acts for Controllable Task-Oriented Dialogue Systems

Aug 01, 2023
Qingyang Wu, James Gung, Raphael Shu, Yi Zhang

Dialogue act annotations are important to improve response generation quality in task-oriented dialogue systems. However, it can be challenging to use dialogue acts to control response generation in a generalizable way because different datasets and tasks may have incompatible annotations. While alternative methods that utilize latent action spaces or reinforcement learning do not require explicit annotations, they may lack interpretability or face difficulties defining task-specific rewards. In this work, we present a novel end-to-end latent dialogue act model (DiactTOD) that represents dialogue acts in a latent space. DiactTOD, when pre-trained on a large corpus, is able to predict and control dialogue acts to generate controllable responses using these latent representations in a zero-shot fashion. Our approach demonstrates state-of-the-art performance across a wide range of experimental settings on the MultiWOZ dataset, including zero-shot, few-shot, and full data fine-tuning with both end-to-end and policy optimization configurations.

* SIGDial 2023 
Viaarxiv icon

Using Textual Interface to Align External Knowledge for End-to-End Task-Oriented Dialogue Systems

May 23, 2023
Qingyang Wu, Deema Alnuhait, Derek Chen, Zhou Yu

Figure 1 for Using Textual Interface to Align External Knowledge for End-to-End Task-Oriented Dialogue Systems
Figure 2 for Using Textual Interface to Align External Knowledge for End-to-End Task-Oriented Dialogue Systems
Figure 3 for Using Textual Interface to Align External Knowledge for End-to-End Task-Oriented Dialogue Systems
Figure 4 for Using Textual Interface to Align External Knowledge for End-to-End Task-Oriented Dialogue Systems

Traditional end-to-end task-oriented dialogue systems have been built with a modularized design. However, such design often causes misalignment between the agent response and external knowledge, due to inadequate representation of information. Furthermore, its evaluation metrics emphasize assessing the agent's pre-lexicalization response, neglecting the quality of the completed response. In this work, we propose a novel paradigm that uses a textual interface to align external knowledge and eliminate redundant processes. We demonstrate our paradigm in practice through MultiWOZ-Remake, including an interactive textual interface built for the MultiWOZ database and a correspondingly re-processed dataset. We train an end-to-end dialogue system to evaluate this new dataset. The experimental results show that our approach generates more natural final responses and achieves a greater task success rate compared to the previous models.

Viaarxiv icon

Visual Instruction Tuning

Apr 17, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

Figure 1 for Visual Instruction Tuning
Figure 2 for Visual Instruction Tuning
Figure 3 for Visual Instruction Tuning
Figure 4 for Visual Instruction Tuning

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

* project page: https://llava-vl.github.io/ 
Viaarxiv icon

ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes

Apr 09, 2023
Ran Gong, Jiangyong Huang, Yizhou Zhao, Haoran Geng, Xiaofeng Gao, Qingyang Wu, Wensi Ai, Ziheng Zhou, Demetri Terzopoulos, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang

Figure 1 for ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes
Figure 2 for ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes
Figure 3 for ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes
Figure 4 for ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes

Understanding the continuous states of objects is essential for task learning and planning in the real world. However, most existing task learning benchmarks assume discrete(e.g., binary) object goal states, which poses challenges for the learning of complex tasks and transferring learned policy from simulated environments to the real world. Furthermore, state discretization limits a robot's ability to follow human instructions based on the grounding of actions and states. To tackle these challenges, we present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals. To promote language-instructed learning, we provide expert demonstrations with template-generated language descriptions. We assess task performance by utilizing the latest language-conditioned policy learning models. Our results indicate that current models for language-conditioned manipulations continue to experience significant challenges in novel goal-state generalizations, scene generalizations, and object generalizations. These findings highlight the need to develop new algorithms that address this gap and underscore the potential for further research in this area. See our project page at: https://arnold-benchmark.github.io

* The first two authors contributed equally; 20 pages; 17 figures; project availalbe: https://arnold-benchmark.github.io/ 
Viaarxiv icon

FaceChat: An Emotion-Aware Face-to-face Dialogue Framework

Mar 08, 2023
Deema Alnuhait, Qingyang Wu, Zhou Yu

Figure 1 for FaceChat: An Emotion-Aware Face-to-face Dialogue Framework
Figure 2 for FaceChat: An Emotion-Aware Face-to-face Dialogue Framework
Figure 3 for FaceChat: An Emotion-Aware Face-to-face Dialogue Framework
Figure 4 for FaceChat: An Emotion-Aware Face-to-face Dialogue Framework

While current dialogue systems like ChatGPT have made significant advancements in text-based interactions, they often overlook the potential of other modalities in enhancing the overall user experience. We present FaceChat, a web-based dialogue framework that enables emotionally-sensitive and face-to-face conversations. By seamlessly integrating cutting-edge technologies in natural language processing, computer vision, and speech processing, FaceChat delivers a highly immersive and engaging user experience. FaceChat framework has a wide range of potential applications, including counseling, emotional support, and personalized customer service. The system is designed to be simple and flexible as a platform for future researchers to advance the field of multimodal dialogue systems. The code is publicly available at https://github.com/qywu/FaceChat.

Viaarxiv icon

GLIGEN: Open-Set Grounded Text-to-Image Generation

Jan 17, 2023
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, Yong Jae Lee

Figure 1 for GLIGEN: Open-Set Grounded Text-to-Image Generation
Figure 2 for GLIGEN: Open-Set Grounded Text-to-Image Generation
Figure 3 for GLIGEN: Open-Set Grounded Text-to-Image Generation
Figure 4 for GLIGEN: Open-Set Grounded Text-to-Image Generation

Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configuration and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.

Viaarxiv icon

KRLS: Improving End-to-End Response Generation in Task Oriented Dialog with Reinforced Keywords Learning

Dec 20, 2022
Xiao Yu, Qingyang Wu, Kun Qian, Zhou Yu

Figure 1 for KRLS: Improving End-to-End Response Generation in Task Oriented Dialog with Reinforced Keywords Learning
Figure 2 for KRLS: Improving End-to-End Response Generation in Task Oriented Dialog with Reinforced Keywords Learning
Figure 3 for KRLS: Improving End-to-End Response Generation in Task Oriented Dialog with Reinforced Keywords Learning
Figure 4 for KRLS: Improving End-to-End Response Generation in Task Oriented Dialog with Reinforced Keywords Learning

In task-oriented dialogs, an informative and successful system response needs to include key information such as the phone number of a hotel. Therefore, we hypothesize that a model can achieve better overall performance by focusing on correctly generating key quantities. In this paper, we propose a new training algorithm, Keywords Reinforcement Learning with Next-word Sampling (KRLS), that utilizes Reinforcement Learning but avoids the time-consuming auto-regressive generation, and a fine-grained per-token reward function to help the model learn keywords generation more robustly. Empirical results show that the KRLS algorithm can achieve state-of-the-art performance on the inform, success, and combined score on the MultiWoZ benchmark dataset.

* added more explaination on the algorithm itself. result tables remain the same 
Viaarxiv icon

Keywords Reinforcement LM: Improving End-to-End Response Generation in Task Oriented Dialog

Dec 13, 2022
Xiao Yu, Qingyang Wu, Kun Qian, Zhou Yu

Figure 1 for Keywords Reinforcement LM: Improving End-to-End Response Generation in Task Oriented Dialog
Figure 2 for Keywords Reinforcement LM: Improving End-to-End Response Generation in Task Oriented Dialog
Figure 3 for Keywords Reinforcement LM: Improving End-to-End Response Generation in Task Oriented Dialog
Figure 4 for Keywords Reinforcement LM: Improving End-to-End Response Generation in Task Oriented Dialog

In task-oriented dialogs such as MultiWoZ (Budzianowski et al., 2018), an informative and successful system response needs to include key information such as the phone number of a hotel. Therefore, we hypothesize that by asking the model to focus on generating more key quantities correctly, it can achieve better overall performance. In this paper, we propose a new training algorithm, Keywords Reinforcement Language Modeling (KRLM), that aims to use a fine-grained reward function for each token and a new per-token Reinforcement Learning procedure to help the model learn keywords generation more robustly during inference. Empirical results show that our proposed KRLM training algorithm can achieve state-of-the-art performance on the inform rate, success rate, and combined score in the MultiWoZ benchmark dataset.

* added algorithm analysis, and supplemental information such as error examples and more training/validation curves 
Viaarxiv icon

Reinforced Language Modeling for End-to-End Task Oriented Dialog

Nov 30, 2022
Xiao Yu, Qingyang Wu, Kun Qian, Zhou Yu

Figure 1 for Reinforced Language Modeling for End-to-End Task Oriented Dialog
Figure 2 for Reinforced Language Modeling for End-to-End Task Oriented Dialog
Figure 3 for Reinforced Language Modeling for End-to-End Task Oriented Dialog
Figure 4 for Reinforced Language Modeling for End-to-End Task Oriented Dialog

In task-oriented dialogs such as MultiWoZ (Budzianowski et al., 2018), an informative and/or successful system response needs to include necessary key information such as the phone number of a hotel. Therefore, we hypothesize that by helping the model to focus more on learning key quantities in the dialog, the model can generative more informative and helpful responses. In this paper, we propose a new training algorithm, Reinforced Language Modeling (RLM), that aims to use a fine-grained reward function and reinforcement learning to help the model focus more on generating key quantities correctly during test time. Empirical results show our proposed RLM achieves state-of-the-art performance on the inform rate, success rate, and combined score in MultiWoZ.

Viaarxiv icon

AU-Aware Vision Transformers for Biased Facial Expression Recognition

Nov 12, 2022
Shuyi Mao, Xinpeng Li, Qingyang Wu, Xiaojiang Peng

Figure 1 for AU-Aware Vision Transformers for Biased Facial Expression Recognition
Figure 2 for AU-Aware Vision Transformers for Biased Facial Expression Recognition
Figure 3 for AU-Aware Vision Transformers for Biased Facial Expression Recognition
Figure 4 for AU-Aware Vision Transformers for Biased Facial Expression Recognition

Studies have proven that domain bias and label bias exist in different Facial Expression Recognition (FER) datasets, making it hard to improve the performance of a specific dataset by adding other datasets. For the FER bias issue, recent researches mainly focus on the cross-domain issue with advanced domain adaption algorithms. This paper addresses another problem: how to boost FER performance by leveraging cross-domain datasets. Unlike the coarse and biased expression label, the facial Action Unit (AU) is fine-grained and objective suggested by psychological studies. Motivated by this, we resort to the AU information of different FER datasets for performance boosting and make contributions as follows. First, we experimentally show that the naive joint training of multiple FER datasets is harmful to the FER performance of individual datasets. We further introduce expression-specific mean images and AU cosine distances to measure FER dataset bias. This novel measurement shows consistent conclusions with experimental degradation of joint training. Second, we propose a simple yet conceptually-new framework, AU-aware Vision Transformer (AU-ViT). It improves the performance of individual datasets by jointly training auxiliary datasets with AU or pseudo-AU labels. We also find that the AU-ViT is robust to real-world occlusions. Moreover, for the first time, we prove that a carefully-initialized ViT achieves comparable performance to advanced deep convolutional networks. Our AU-ViT achieves state-of-the-art performance on three popular datasets, namely 91.10% on RAF-DB, 65.59% on AffectNet, and 90.15% on FERPlus. The code and models will be released soon.

Viaarxiv icon