Alert button
Picture for Yuhao Zhang

Yuhao Zhang

Alert button

DragVideo: Interactive Drag-style Video Editing

Dec 03, 2023
Yufan Deng, Ruida Wang, Yuhao Zhang, Yu-Wing Tai, Chi-Keung Tang

Editing visual content on videos remains a formidable challenge with two main issues: 1) direct and easy user control to produce 2) natural editing results without unsightly distortion and artifacts after changing shape, expression and layout. Inspired by DragGAN, a recent image-based drag-style editing technique, we address above issues by proposing DragVideo, where a similar drag-style user interaction is adopted to edit video content while maintaining temporal consistency. Empowered by recent diffusion models as in DragDiffusion, DragVideo contains the novel Drag-on-Video U-Net (DoVe) editing method, which optimizes diffused video latents generated by video U-Net to achieve the desired control. Specifically, we use Sample-specific LoRA fine-tuning and Mutual Self-Attention control to ensure faithful reconstruction of video from the DoVe method. We also present a series of testing examples for drag-style video editing and conduct extensive experiments across a wide array of challenging editing tasks, such as motion editing, skeleton editing, etc, underscoring DragVideo's versatility and generality. Our codes including the DragVideo web user interface will be released.

Viaarxiv icon

Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large Vision-Language Models

Nov 30, 2023
Dong Li, Jiandong Jin, Yuhao Zhang, Yanlin Zhong, Yaoyang Wu, Lan Chen, Xiao Wang, Bin Luo

Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from key issues like sematic gaps and small-scale backbone networks. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision-language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code will be made available at https://github.com/Event-AHU/SAFE_LargeVLM.

* In Peer Review 
Viaarxiv icon

TrainerAgent: Customizable and Efficient Model Training through LLM-Powered Multi-Agent System

Nov 23, 2023
Haoyuan Li, Hao Jiang, Tianke Zhang, Zhelun Yu, Aoxiong Yin, Hao Cheng, Siming Fu, Yuhao Zhang, Wanggui He

Training AI models has always been challenging, especially when there is a need for custom models to provide personalized services. Algorithm engineers often face a lengthy process to iteratively develop models tailored to specific business requirements, making it even more difficult for non-experts. The quest for high-quality and efficient model development, along with the emergence of Large Language Model (LLM) Agents, has become a key focus in the industry. Leveraging the powerful analytical, planning, and decision-making capabilities of LLM, we propose a TrainerAgent system comprising a multi-agent framework including Task, Data, Model and Server agents. These agents analyze user-defined tasks, input data, and requirements (e.g., accuracy, speed), optimizing them comprehensively from both data and model perspectives to obtain satisfactory models, and finally deploy these models as online service. Experimental evaluations on classical discriminative and generative tasks in computer vision and natural language processing domains demonstrate that our system consistently produces models that meet the desired criteria. Furthermore, the system exhibits the ability to critically identify and reject unattainable tasks, such as fantastical scenarios or unethical requests, ensuring robustness and safety. This research presents a significant advancement in achieving desired models with increased efficiency and quality as compared to traditional model development, facilitated by the integration of LLM-powered analysis, decision-making, and execution capabilities, as well as the collaboration among four agents. We anticipate that our work will contribute to the advancement of research on TrainerAgent in both academic and industry communities, potentially establishing it as a new paradigm for model development in the field of AI.

Viaarxiv icon

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

Nov 07, 2023
Yuhao Zhang, Chen Xu, Bei Li, Hao Chen, Tong Xiao, Chunliang Zhang, Jingbo Zhu

Significant improvements in end-to-end speech translation (ST) have been achieved through the application of multi-task learning. However, the extent to which auxiliary tasks are highly consistent with the ST task, and how much this approach truly helps, have not been thoroughly studied. In this paper, we investigate the consistency between different tasks, considering different times and modules. We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations. Furthermore, we propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation. We conduct experiments on the MuST-C dataset. The results demonstrate that our method attains state-of-the-art results. Moreover, when additional data is used, we achieve the new SOTA result on MuST-C English to Spanish task with 20.8% of the training time required by the current SOTA method.

* Accepted to EMNLP2023 main conference 
Viaarxiv icon

Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

Sep 21, 2023
Chen Xu, Xiaoqian Liu, Erfeng He, Yuhao Zhang, Qianqian Dong, Tong Xiao, Jingbo Zhu, Dapeng Man, Wu Yang

Figure 1 for Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition
Figure 2 for Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition
Figure 3 for Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition
Figure 4 for Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Building upon the recent advances in CTC application, we develop an enhanced variant, BiL-CTC+, that establishes new state-of-the-art performances on the MuST-C ST benchmarks under resource-constrained scenarios. Intriguingly, our method also yields significant improvements in speech recognition performance, revealing the effect of cross-lingual learning on transcription and demonstrating its broad applicability. The source code is available at https://github.com/xuchennlp/S2T.

* Submitted to ICASSP 2024 
Viaarxiv icon

Channel sensing for holographic MIMO surfaces based on interference principle

Aug 20, 2023
Jindiao Huang, Yuyao Wu, Haifan Yin, Yuhao Zhang, Ruikun Zhang

Figure 1 for Channel sensing for holographic MIMO surfaces based on interference principle
Figure 2 for Channel sensing for holographic MIMO surfaces based on interference principle
Figure 3 for Channel sensing for holographic MIMO surfaces based on interference principle
Figure 4 for Channel sensing for holographic MIMO surfaces based on interference principle

The Holographic Multiple-Input and Multiple-Output (HMIMO) provides a new paradigm for building a more cost-effective wireless communication architecture. In this paper, we derive the principles of holographic interference theory for electromagnetic wave reception and transmission, whereby the optical holography is extended to communication holography and a channel sensing architecture for holographic MIMO surfaces is established. Unlike the traditional pilot-based MIMO channel estimation approaches, the proposed architecture circumvents the complicated processes like filtering, analog to digital conversion (ADC), down conversion. Instead, it relies on interfering the object waves with a pre-designed reference wave, and therefore reduces the hardware complexity and requires less time-frequency resources for channel estimation. To address the self-interference problem in the holographic recording process, we propose a phase shifting-based interference suppression (PSIS) method according to the structural characteristics of communication hologram and interference composition. We then propose a Prony-based multi-user channel segmentation (PMCS) algorithm to acquire the channel state information (CSI). Our theoretical analysis shows that the estimation error of the PMCS algorithm converges to zero when the number of HMIMO surface antennas is large enough. Simulation results show that under the holographic architecture, our proposed algorithm can accurately estimate the CSI in multi-user scenarios.

Viaarxiv icon

Falsification of a Vision-based Automatic Landing System

Jul 04, 2023
Sara Shoouri, Shayan Jalili, Jiahong Xu, Isabelle Gallagher, Yuhao Zhang, Joshua Wilhelm, Necmiye Ozay, Jean-Baptiste Jeannin

Figure 1 for Falsification of a Vision-based Automatic Landing System
Figure 2 for Falsification of a Vision-based Automatic Landing System
Figure 3 for Falsification of a Vision-based Automatic Landing System
Figure 4 for Falsification of a Vision-based Automatic Landing System

At smaller airports without an instrument approach or advanced equipment, automatic landing of aircraft is a safety-critical task that requires the use of sensors present on the aircraft. In this paper, we study falsification of an automatic landing system for fixed-wing aircraft using a camera as its main sensor. We first present an architecture for vision-based automatic landing, including a vision-based runway distance and orientation estimator and an associated PID controller. We then outline landing specifications that we validate with actual flight data. Using these specifications, we propose the use of the falsification tool Breach to find counterexamples to the specifications in the automatic landing system. Our experiments are implemented using a Beechcraft Baron 58 in the X-Plane flight simulator communicating with MATLAB Simulink.

* AIAA Scitech 2021 Forum 
Viaarxiv icon

Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

May 30, 2023
Xingyu Fu, Sheng Zhang, Gukyeong Kwon, Pramuditha Perera, Henghui Zhu, Yuhao Zhang, Alexander Hanbo Li, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Dan Roth, Bing Xiang

Figure 1 for Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge
Figure 2 for Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge
Figure 3 for Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge
Figure 4 for Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. However, these methods suffer from low knowledge coverage caused by PLM bias -- the tendency to generate certain tokens over other tokens regardless of prompt changes, and high dependency on the PLM quality -- only models using GPT-3 can achieve the best result. To address the aforementioned challenges, we propose RASO: a new VQA pipeline that deploys a generate-then-select strategy guided by world knowledge for the first time. Rather than following the de facto standard to train a multi-modal model that directly generates the VQA answer, RASO first adopts PLM to generate all the possible answers, and then trains a lightweight answer selection model for the correct answer. As proved in our analysis, RASO expands the knowledge coverage from in-domain training data by a large margin. We provide extensive experimentation and show the effectiveness of our pipeline by advancing the state-of-the-art by 4.1% on OK-VQA, without additional computation cost. Code and models are released at http://cogcomp.org/page/publication_view/1010

* Accepted to ACL 2023 Findings 
Viaarxiv icon

CTC-based Non-autoregressive Speech Translation

May 27, 2023
Chen Xu, Xiaoqian Liu, Xiaowen Liu, Qingxuan Sun, Yuhao Zhang, Murun Yang, Qianqian Dong, Tom Ko, Mingxuan Wang, Tong Xiao, Anxiang Ma, Jingbo Zhu

Figure 1 for CTC-based Non-autoregressive Speech Translation
Figure 2 for CTC-based Non-autoregressive Speech Translation
Figure 3 for CTC-based Non-autoregressive Speech Translation
Figure 4 for CTC-based Non-autoregressive Speech Translation

Combining end-to-end speech translation (ST) and non-autoregressive (NAR) generation is promising in language and speech processing for their advantages of less error propagation and low latency. In this paper, we investigate the potential of connectionist temporal classification (CTC) for non-autoregressive speech translation (NAST). In particular, we develop a model consisting of two encoders that are guided by CTC to predict the source and target texts, respectively. Introducing CTC into NAST on both language sides has obvious challenges: 1) the conditional independent generation somewhat breaks the interdependency among tokens, and 2) the monotonic alignment assumption in standard CTC does not hold in translation tasks. In response, we develop a prediction-aware encoding approach and a cross-layer attention approach to address these issues. We also use curriculum learning to improve convergence of training. Experiments on the MuST-C ST benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$\times$, which is comparable to the autoregressive counterpart and even outperforms the previous best result of 0.9 BLEU points.

* ACL 2023 Main Conference 
Viaarxiv icon