Alert button
Picture for Xu Zhang

Xu Zhang

Alert button

TaskWeaver: A Code-First Agent Framework

Nov 29, 2023
Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, Minghua Ma, Pu Zhao, Si Qin, Xiaoting Qin, Chao Du, Yong Xu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Language Language Models (LLMs) have shown impressive abilities in natural language understanding and generation, leading to their use in applications such as chatbots and virtual assistants. However, existing LLM frameworks face limitations in handling domain-specific data analytics tasks with rich data structures. Moreover, they struggle with flexibility to meet diverse user requirements. To address these issues, TaskWeaver is proposed as a code-first framework for building LLM-powered autonomous agents. It converts user requests into executable code and treats user-defined plugins as callable functions. TaskWeaver provides support for rich data structures, flexible plugin usage, and dynamic plugin selection, and leverages LLM coding capabilities for complex logic. It also incorporates domain-specific knowledge through examples and ensures the secure execution of generated code. TaskWeaver offers a powerful and flexible framework for creating intelligent conversational agents that can handle complex tasks and adapt to domain-specific scenarios. The code is open-sourced at https://github.com/microsoft/TaskWeaver/.

Viaarxiv icon

Language Grounded QFormer for Efficient Vision Language Understanding

Nov 13, 2023
Moulik Choraria, Nitesh Sekhar, Yue Wu, Xu Zhang, Prateek Singhal, Lav R. Varshney

Large-scale pretraining and instruction tuning have been successful for training general-purpose language models with broad competencies. However, extending to general-purpose vision-language models is challenging due to the distributional diversity in visual inputs. A recent line of work explores vision-language instruction tuning, taking inspiration from the Query Transformer (QFormer) approach proposed in BLIP-2 models for bridging frozen modalities. However, these approaches rely heavily on large-scale multi-modal pretraining for representation learning before eventual finetuning, incurring a huge computational overhead, poor scaling, and limited accessibility. To that end, we propose a more efficient method for QFormer-based vision-language alignment and demonstrate the effectiveness of our strategy compared to existing baselines in improving the efficiency of vision-language pretraining.

* Preprint Under Review 
Viaarxiv icon

An Empirical Study of Instruction-tuning Large Language Models in Chinese

Oct 20, 2023
Qingyi Si, Tong Wang, Zheng Lin, Xu Zhang, Yanan Cao, Weiping Wang

Figure 1 for An Empirical Study of Instruction-tuning Large Language Models in Chinese
Figure 2 for An Empirical Study of Instruction-tuning Large Language Models in Chinese
Figure 3 for An Empirical Study of Instruction-tuning Large Language Models in Chinese
Figure 4 for An Empirical Study of Instruction-tuning Large Language Models in Chinese

The success of ChatGPT validates the potential of large language models (LLMs) in artificial general intelligence (AGI). Subsequently, the release of LLMs has sparked the open-source community's interest in instruction-tuning, which is deemed to accelerate ChatGPT's replication process. However, research on instruction-tuning LLMs in Chinese, the world's most spoken language, is still in its early stages. Therefore, this paper makes an in-depth empirical study of instruction-tuning LLMs in Chinese, which can serve as a cookbook that provides valuable findings for effectively customizing LLMs that can better respond to Chinese instructions. Specifically, we systematically explore the impact of LLM bases, parameter-efficient methods, instruction data types, which are the three most important elements for instruction-tuning. Besides, we also conduct experiment to study the impact of other factors, e.g., chain-of-thought data and human-value alignment. We hope that this empirical study can make a modest contribution to the open Chinese version of ChatGPT. This paper will release a powerful Chinese LLMs that is comparable to ChatGLM. The code and data are available at https://github.com/PhoebusSi/Alpaca-CoT.

* EMNLP 2023 
Viaarxiv icon

Multiagent Reinforcement Learning with an Attention Mechanism for Improving Energy Efficiency in LoRa Networks

Sep 16, 2023
Xu Zhang, Ziqi Lin, Shimin Gong, Bo Gu, Dusit Niyato

Figure 1 for Multiagent Reinforcement Learning with an Attention Mechanism for Improving Energy Efficiency in LoRa Networks
Figure 2 for Multiagent Reinforcement Learning with an Attention Mechanism for Improving Energy Efficiency in LoRa Networks
Figure 3 for Multiagent Reinforcement Learning with an Attention Mechanism for Improving Energy Efficiency in LoRa Networks
Figure 4 for Multiagent Reinforcement Learning with an Attention Mechanism for Improving Energy Efficiency in LoRa Networks

Long Range (LoRa) wireless technology, characterized by low power consumption and a long communication range, is regarded as one of the enabling technologies for the Industrial Internet of Things (IIoT). However, as the network scale increases, the energy efficiency (EE) of LoRa networks decreases sharply due to severe packet collisions. To address this issue, it is essential to appropriately assign transmission parameters such as the spreading factor and transmission power for each end device (ED). However, due to the sporadic traffic and low duty cycle of LoRa networks, evaluating the system EE performance under different parameter settings is time-consuming. Therefore, we first formulate an analytical model to calculate the system EE. On this basis, we propose a transmission parameter allocation algorithm based on multiagent reinforcement learning (MALoRa) with the aim of maximizing the system EE of LoRa networks. Notably, MALoRa employs an attention mechanism to guide each ED to better learn how much ''attention'' should be given to the parameter assignments for relevant EDs when seeking to improve the system EE. Simulation results demonstrate that MALoRa significantly improves the system EE compared with baseline algorithms with an acceptable degradation in packet delivery rate (PDR).

* 6 pages, 3 figures, This paper has been accepted for publication in IEEE Global Communications Conference (GLOBECOM) 2023 
Viaarxiv icon

FashionNTM: Multi-turn Fashion Image Retrieval via Cascaded Memory

Aug 20, 2023
Anwesan Pal, Sahil Wadhwa, Ayush Jaiswal, Xu Zhang, Yue Wu, Rakesh Chada, Pradeep Natarajan, Henrik I. Christensen

Figure 1 for FashionNTM: Multi-turn Fashion Image Retrieval via Cascaded Memory
Figure 2 for FashionNTM: Multi-turn Fashion Image Retrieval via Cascaded Memory
Figure 3 for FashionNTM: Multi-turn Fashion Image Retrieval via Cascaded Memory
Figure 4 for FashionNTM: Multi-turn Fashion Image Retrieval via Cascaded Memory

Multi-turn textual feedback-based fashion image retrieval focuses on a real-world setting, where users can iteratively provide information to refine retrieval results until they find an item that fits all their requirements. In this work, we present a novel memory-based method, called FashionNTM, for such a multi-turn system. Our framework incorporates a new Cascaded Memory Neural Turing Machine (CM-NTM) approach for implicit state management, thereby learning to integrate information across all past turns to retrieve new images, for a given turn. Unlike vanilla Neural Turing Machine (NTM), our CM-NTM operates on multiple inputs, which interact with their respective memories via individual read and write heads, to learn complex relationships. Extensive evaluation results show that our proposed method outperforms the previous state-of-the-art algorithm by 50.5%, on Multi-turn FashionIQ -- the only existing multi-turn fashion dataset currently, in addition to having a relative improvement of 12.6% on Multi-turn Shoes -- an extension of the single-turn Shoes dataset that we created in this work. Further analysis of the model in a real-world interactive setting demonstrates two important capabilities of our model -- memory retention across turns, and agnosticity to turn order for non-contradictory feedback. Finally, user study results show that images retrieved by FashionNTM were favored by 83.1% over other multi-turn models. Project page: https://sites.google.com/eng.ucsd.edu/fashionntm

* Paper accepted at ICCV-2023 
Viaarxiv icon

TVPR: Text-to-Video Person Retrieval and a New Benchmark

Jul 14, 2023
Fan Ni, Xu Zhang, Jianhui Wu, Guan-Nan Dong, Aichun Zhu, Hui Liu, Yue Zhang

Figure 1 for TVPR: Text-to-Video Person Retrieval and a New Benchmark
Figure 2 for TVPR: Text-to-Video Person Retrieval and a New Benchmark
Figure 3 for TVPR: Text-to-Video Person Retrieval and a New Benchmark
Figure 4 for TVPR: Text-to-Video Person Retrieval and a New Benchmark

Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured in isolated frames or variable motion details are given in the textual description. In this paper, we propose a new task called Text-to-Video Person Retrieval(TVPR) which aims to effectively overcome the limitations of isolated frames. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, such as person's appearance, actions and interactions with environment, etc., termed as Text-to-Video Person Re-identification (TVPReid) dataset, which will be publicly available. To this end, a Text-to-Video Person Retrieval Network (TVPRN) is proposed. Specifically, TVPRN acquires video representations by fusing visual and motion representations of person videos, which can deal with temporal occlusion and the absence of variable motion details in isolated frames. Meanwhile, we employ the pre-trained BERT to obtain caption representations and the relationship between caption and video representations to reveal the most relevant person videos. To evaluate the effectiveness of the proposed TVPRN, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, TVPRN is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.

Viaarxiv icon

Improved NL2SQL based on Multi-layer Expert Network

Jun 30, 2023
Chenduo Hao, Xu Zhang

Figure 1 for Improved NL2SQL based on Multi-layer Expert Network
Figure 2 for Improved NL2SQL based on Multi-layer Expert Network
Figure 3 for Improved NL2SQL based on Multi-layer Expert Network

The Natural Language to SQL (NL2SQL) technique is used to convert natural language queries into executable SQL statements. Typically, slot-filling is employed as a classification method for multi-task cases to achieve this goal. However, slot-filling can result in inaccurate SQL statement generation due to negative migration issues arising from different classification tasks. To overcome this limitation, this study introduces a new approach called Multi-Layer Expert Generate SQL (MLEG-SQL), which utilizes a dedicated multi-task hierarchical network. The lower layer of the network extracts semantic features of natural language statements, while the upper layer builds a specialized expert system for handling specific classification tasks. This hierarchical approach mitigates performance degradation resulting from different task conflicts. The proposed method was evaluated on the WiKSQL dataset and was found to be effective in generating accurate SQL statements.

* 6pages 1figures 2023 3rd International Conference on Advanced Algorithms and Neural Networks 
Viaarxiv icon

Feature Representation Learning for NL2SQL Generation Based on Coupling and Decoupling

Jun 30, 2023
Chenduo Hao, Xu Zhang, Chuanbao Gao, Deyu Zhou

Figure 1 for Feature Representation Learning for NL2SQL Generation Based on Coupling and Decoupling
Figure 2 for Feature Representation Learning for NL2SQL Generation Based on Coupling and Decoupling
Figure 3 for Feature Representation Learning for NL2SQL Generation Based on Coupling and Decoupling
Figure 4 for Feature Representation Learning for NL2SQL Generation Based on Coupling and Decoupling

The NL2SQL task involves parsing natural language statements into SQL queries. While most state-of-the-art methods treat NL2SQL as a slot-filling task and use feature representation learning techniques, they overlook explicit correlation features between the SELECT and WHERE clauses and implicit correlation features between sub-tasks within a single clause. To address this issue, we propose the Clause Feature Correlation Decoupling and Coupling (CFCDC) model, which uses a feature representation decoupling method to separate the SELECT and WHERE clauses at the parameter level. Next, we introduce a multi-task learning architecture to decouple implicit correlation feature representation between different SQL tasks in a specific clause. Moreover, we present an improved feature representation coupling module to integrate the decoupled tasks in the SELECT and WHERE clauses and predict the final SQL query. Our proposed CFCDC model demonstrates excellent performance on the WikiSQL dataset, with significant improvements in logic precision and execution accuracy. The source code for the model will be publicly available on GitHub

* 12pages, 4figures International Conference on Artificial Neural Networks 
Viaarxiv icon

Meta Generative Flow Networks with Personalization for Task-Specific Adaptation

Jun 16, 2023
Xinyuan Ji, Xu Zhang, Wei Xi, Haozhi Wang, Olga Gadyatskaya, Yinchuan Li

Figure 1 for Meta Generative Flow Networks with Personalization for Task-Specific Adaptation
Figure 2 for Meta Generative Flow Networks with Personalization for Task-Specific Adaptation
Figure 3 for Meta Generative Flow Networks with Personalization for Task-Specific Adaptation
Figure 4 for Meta Generative Flow Networks with Personalization for Task-Specific Adaptation

Multi-task reinforcement learning and meta-reinforcement learning have been developed to quickly adapt to new tasks, but they tend to focus on tasks with higher rewards and more frequent occurrences, leading to poor performance on tasks with sparse rewards. To address this issue, GFlowNets can be integrated into meta-learning algorithms (GFlowMeta) by leveraging the advantages of GFlowNets on tasks with sparse rewards. However, GFlowMeta suffers from performance degradation when encountering heterogeneous transitions from distinct tasks. To overcome this challenge, this paper proposes a personalized approach named pGFlowMeta, which combines task-specific personalized policies with a meta policy. Each personalized policy balances the loss on its personalized task and the difference from the meta policy, while the meta policy aims to minimize the average loss of all tasks. The theoretical analysis shows that the algorithm converges at a sublinear rate. Extensive experiments demonstrate that the proposed algorithm outperforms state-of-the-art reinforcement learning algorithms in discrete environments.

* journal 
Viaarxiv icon