Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tao Zhang

The Rise of UAV Fleet Technologies for Emergency Wireless Communications in Harsh Environments

Jul 24, 2024

Zhuohui Yao, Wenchi Cheng, Wei Zhang, Tao Zhang, Hailin Zhang

Figure 1 for The Rise of UAV Fleet Technologies for Emergency Wireless Communications in Harsh Environments

Figure 2 for The Rise of UAV Fleet Technologies for Emergency Wireless Communications in Harsh Environments

Figure 3 for The Rise of UAV Fleet Technologies for Emergency Wireless Communications in Harsh Environments

Figure 4 for The Rise of UAV Fleet Technologies for Emergency Wireless Communications in Harsh Environments

Abstract:For unforeseen emergencies, such as natural disasters and pandemic events, it is highly demanded to cope with the explosive growth of mobile data traffic in extremely critical environments. An Unmanned aerial vehicle (UAV) fleet is an effective way to facilitate the Emergency wireless COmmunication NETwork (EcoNet). In this article, a MUlti-tier Heterogeneous UAV Network (MuHun), which is with different UAV fleets in different altitudes, is proposed to flexibly serve various emergencies. We refresh the key performance indicators of full coverage, network capacity, low latency, and energy efficiency in harsh environments. Then, we present the special challenges regarding shadowing-dominated complex channel model, energy supply limited short-endurance, various communication mechanisms coexistence, and communication island for underground users in UAV-based EcoNet, followed by the MuHun-based EcoNet architecture and its advantages. Furthermore, some potential solutions such as the new hybrid-channel adapted resource allocation, reconfigurable intelligent surface assisted UAV communications, competitive heterogenous-networks, and magnetic induction based air-to-ground/underground communications are discussed to effectively achieve full coverage, high capacity, high energy efficiency, and diverse qualities of services for EcoNets in harsh environments.

Via

Access Paper or Ask Questions

Mix-CPT: A Domain Adaptation Framework via Decoupling Knowledge Learning and Format Alignment

Jul 15, 2024

Jinhao Jiang, Junyi Li, Wayne Xin Zhao, Yang Song, Tao Zhang, Ji-Rong Wen

Abstract:Adapting general large language models (LLMs) to specialized domains presents great challenges due to varied data distributions. This adaptation typically requires continual pre-training on massive domain-specific corpora to facilitate knowledge memorization, followed by training to apply this knowledge following human instructions and preferences. However, this method may result in inefficient knowledge memorization due to a lack of awareness of knowledge utilization and imposes substantial demands on LLMs to simultaneously learn knowledge utilization and format alignment with limited training samples. To facilitate the domain adaptation of LLM, we revise this process and propose a new domain adaptation framework including domain knowledge learning and general format alignment, called Mix-CPT. Specifically, we first conduct a knowledge mixture continual pre-training that concurrently focuses on knowledge memorization and utilization, allowing for mutual reinforcement. To avoid catastrophic forgetting during the continual pre-training process, we further incorporate a logit swap self-distillation constraint. Subsequently, leveraging the knowledge and capabilities acquired during continual pre-training, we efficiently perform instruction tuning and alignment with a few general training samples to achieve format alignment. Extensive experiments demonstrate that our proposed Mix-CPT framework can simultaneously improve the task-solving capabilities of LLMs on the target and general domains compared to the traditional adaptation methods.

* LLM, CPT, knowledge learning, format alignment; work in progress

Via

Access Paper or Ask Questions

Navi2Gaze: Leveraging Foundation Models for Navigation and Target Gazing

Jul 12, 2024

Jun Zhu, Zihao Du, Haotian Xu, Fengbo Lan, Zilong Zheng, Bo Ma, Shengjie Wang, Tao Zhang

Figure 1 for Navi2Gaze: Leveraging Foundation Models for Navigation and Target Gazing

Figure 2 for Navi2Gaze: Leveraging Foundation Models for Navigation and Target Gazing

Figure 3 for Navi2Gaze: Leveraging Foundation Models for Navigation and Target Gazing

Figure 4 for Navi2Gaze: Leveraging Foundation Models for Navigation and Target Gazing

Abstract:Task-aware navigation continues to be a challenging area of research, especially in scenarios involving open vocabulary. Previous studies primarily focus on finding suitable locations for task completion, often overlooking the importance of the robot's pose. However, the robot's orientation is crucial for successfully completing tasks because of how objects are arranged (e.g., to open a refrigerator door). Humans intuitively navigate to objects with the right orientation using semantics and common sense. For instance, when opening a refrigerator, we naturally stand in front of it rather than to the side. Recent advances suggest that Vision-Language Models (VLMs) can provide robots with similar common sense. Therefore, we develop a VLM-driven method called Navigation-to-Gaze (Navi2Gaze) for efficient navigation and object gazing based on task descriptions. This method uses the VLM to score and select the best pose from numerous candidates automatically. In evaluations on multiple photorealistic simulation benchmarks, Navi2Gaze significantly outperforms existing approaches and precisely determines the optimal orientation relative to target objects.

Via

Access Paper or Ask Questions

PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

Jul 11, 2024

Miao Zheng, Hao Liang, Fan Yang, Haoze Sun, Tianpeng Li, Lingchu Xiong, Yan Zhang, Youzhen Wu, Kun Li, Yanjun Shen(+9 more)

Figure 1 for PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

Figure 2 for PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

Figure 3 for PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

Figure 4 for PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

Abstract:In recent years, the rise of Large Language Models (LLMs) has spurred a growing demand for plug-and-play AI systems. Among the various AI techniques, prompt engineering stands out as particularly significant. However, users often face challenges in writing prompts due to the steep learning curve and significant time investment, and existing automatic prompt engineering (APE) models can be difficult to use. To address this issue, we propose PAS, an LLM-based plug-and-play APE system. PAS utilizes LLMs trained on high-quality, automatically generated prompt complementary datasets, resulting in exceptional performance. In comprehensive benchmarks, PAS achieves state-of-the-art (SoTA) results compared to previous APE models, with an average improvement of 6.09 points. Moreover, PAS is highly efficient, achieving SoTA performance with only 9000 data points. Additionally, PAS can autonomously generate prompt augmentation data without requiring additional human labor. Its flexibility also allows it to be compatible with all existing LLMs and applicable to a wide range of tasks. PAS excels in human evaluations, underscoring its suitability as a plug-in for users. This combination of high performance, efficiency, and flexibility makes PAS a valuable system for enhancing the usability and effectiveness of LLMs through improved prompt engineering.

Via

Access Paper or Ask Questions

A Generative Approach to Control Complex Physical Systems

Jul 09, 2024

Long Wei, Peiyan Hu, Ruiqi Feng, Haodong Feng, Yixuan Du, Tao Zhang, Rui Wang, Yue Wang, Zhi-Ming Ma, Tailin Wu

Figure 1 for A Generative Approach to Control Complex Physical Systems

Figure 2 for A Generative Approach to Control Complex Physical Systems

Figure 3 for A Generative Approach to Control Complex Physical Systems

Figure 4 for A Generative Approach to Control Complex Physical Systems

Abstract:Controlling the evolution of complex physical systems is a fundamental task across science and engineering. Classical techniques suffer from limited applicability or huge computational costs. On the other hand, recent deep learning and reinforcement learning-based approaches often struggle to optimize long-term control sequences under the constraints of system dynamics. In this work, we introduce Diffusion Physical systems Control (DiffPhyCon), a new class of method to address the physical systems control problem. DiffPhyCon excels by simultaneously minimizing both the learned generative energy function and the predefined control objectives across the entire trajectory and control sequence. Thus, it can explore globally and identify near-optimal control sequences. Moreover, we enhance DiffPhyCon with prior reweighting, enabling the discovery of control sequences that significantly deviate from the training distribution. We test our method in 1D Burgers' equation and 2D jellyfish movement control in a fluid environment. Our method outperforms widely applied classical approaches and state-of-the-art deep learning and reinforcement learning methods. Notably, DiffPhyCon unveils an intriguing fast-close-slow-open pattern observed in the jellyfish, aligning with established findings in the field of fluid dynamics.

Via

Access Paper or Ask Questions

MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

Jul 03, 2024

Zihao Wang, Haoxuan Liu, Jiaxing Yu, Tao Zhang, Yan Liu, Kejun Zhang

Figure 1 for MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

Figure 2 for MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

Figure 3 for MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

Figure 4 for MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

Abstract:Amid the rising intersection of generative AI and human artistic processes, this study probes the critical yet less-explored terrain of alignment in human-centric automatic song composition. We propose a novel task of Colloquial Description-to-Song Generation, which focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquial language understanding and auditory expression within an AI model, with the ultimate goal of creating songs that accurately satisfy human auditory expectations and structurally align with musical norms. Current datasets are limited due to their narrow descriptive scope, semantic gaps and inaccuracies. To overcome data scarcity in this domain, we present the Caichong Music Dataset (CaiMD). CaiMD is manually annotated by both professional musicians and amateurs, offering diverse perspectives and a comprehensive understanding of colloquial descriptions. Unlike existing datasets pre-set with expert annotations or auto-generated ones with inherent biases, CaiMD caters more sufficiently to our purpose of aligning AI-generated music with widespread user-desired results. Moreover, we propose an innovative single-stage framework called MuDiT/MuSiT for enabling effective human-machine alignment in song creation. This framework not only achieves cross-modal comprehension between colloquial language and auditory music perceptions but also ensures generated songs align with user-desired results. MuDiT/MuSiT employs one DiT/SiT model for end-to-end generation of musical components like melody, harmony, rhythm, vocals, and instrumentation. The approach ensures harmonious sonic cohesiveness amongst all generated musical components, facilitating better resonance with human auditory expectations.

* 19 pages, 5 figures

Via

Access Paper or Ask Questions

Mobile Robot Oriented Large-Scale Indoor Dataset for Dynamic Scene Understanding

Jun 28, 2024

Yifan Tang, Cong Tai, Fangxing Chen, Wanting Zhang, Tao Zhang, Xueping Liu, Yongjin Liu, Long Zeng

Abstract:Most existing robotic datasets capture static scene data and thus are limited in evaluating robots' dynamic performance. To address this, we present a mobile robot oriented large-scale indoor dataset, denoted as THUD (Tsinghua University Dynamic) robotic dataset, for training and evaluating their dynamic scene understanding algorithms. Specifically, the THUD dataset construction is first detailed, including organization, acquisition, and annotation methods. It comprises both real-world and synthetic data, collected with a real robot platform and a physical simulation platform, respectively. Our current dataset includes 13 larges-scale dynamic scenarios, 90K image frames, 20M 2D/3D bounding boxes of static and dynamic objects, camera poses, and IMU. The dataset is still continuously expanding. Then, the performance of mainstream indoor scene understanding tasks, e.g. 3D object detection, semantic segmentation, and robot relocalization, is evaluated on our THUD dataset. These experiments reveal serious challenges for some robot scene understanding tasks in dynamic scenes. By sharing this dataset, we aim to foster and iterate new mobile robot algorithms quickly for robot actual working dynamic environment, i.e. complex crowded dynamic scenes.

Via

Access Paper or Ask Questions

Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model

Jun 27, 2024

Haobo Yuan, Xiangtai Li, Lu Qi, Tao Zhang, Ming-Hsuan Yang, Shuicheng Yan, Chen Change Loy

Abstract:Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much attention as they can process long sequences efficiently. In this work, we focus on designing an efficient segment-anything model by exploring these different architectures. Specifically, we design a mixed backbone that contains convolution and RWKV operation, which achieves the best for both accuracy and efficiency. In addition, we design an efficient decoder to utilize the multiscale tokens to obtain high-quality masks. We denote our method as RWKV-SAM, a simple, effective, fast baseline for SAM-like models. Moreover, we build a benchmark containing various high-quality segmentation datasets and jointly train one efficient yet high-quality segmentation model using this benchmark. Based on the benchmark results, our RWKV-SAM achieves outstanding performance in efficiency and segmentation quality compared to transformers and other linear attention models. For example, compared with the same-scale transformer model, RWKV-SAM achieves more than 2x speedup and can achieve better segmentation performance on various datasets. In addition, RWKV-SAM outperforms recent vision Mamba models with better classification and semantic segmentation results. Code and models will be publicly available.

* 16 pages; 8 figures

Via

Access Paper or Ask Questions

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Jun 27, 2024

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan

Figure 1 for OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Figure 2 for OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Figure 3 for OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Figure 4 for OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Abstract:Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

Via

Access Paper or Ask Questions

GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

Jun 20, 2024

Tao Zhang, Ziqian Zeng, Yuxiang Xiao, Huiping Zhuang, Cen Chen, James Foulds, Shimei Pan

Figure 1 for GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

Figure 2 for GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

Figure 3 for GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

Figure 4 for GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

Abstract:Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicly available. The commonly used and publicly available alignment dataset, HH-RLHF, still exhibits gender bias to some extent. There is a lack of publicly available alignment datasets specifically designed to address gender bias. Hence, we developed a new dataset named GenderAlign, aiming at mitigating a comprehensive set of gender biases in LLMs. This dataset comprises 8k single-turn dialogues, each paired with a "chosen" and a "rejected" response. Compared to the "rejected" responses, the "chosen" responses demonstrate lower levels of gender bias and higher quality. Furthermore, we categorized the gender biases in the "rejected" responses of GenderAlign into 4 principal categories. The experimental results show the effectiveness of GenderAlign in reducing gender bias in LLMs.

Via

Access Paper or Ask Questions