Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yu Wu

Wuhan University

Accelerating Transducers through Adjacent Token Merging

Jun 28, 2023

Yuang Li, Yu Wu, Jinyu Li, Shujie Liu

Figure 1 for Accelerating Transducers through Adjacent Token Merging

Figure 2 for Accelerating Transducers through Adjacent Token Merging

Figure 3 for Accelerating Transducers through Adjacent Token Merging

Figure 4 for Accelerating Transducers through Adjacent Token Merging

Abstract:Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to the quadratic computation of self-attention. To address this, we propose a new method, Adjacent Token Merging (A-ToMe), which gradually combines adjacent tokens with high similarity scores between their key values. In this way, the total time step could be reduced, and the inference of both the encoder and joint network is accelerated. Experiments on LibriSpeech show that our method can reduce 57% of tokens and improve the inference speed on GPU by 70% without any notable loss of accuracy. Additionally, we demonstrate that A-ToMe is also an effective solution to reduce tokens in long-form ASR, where the input speech consists of multiple utterances.

* Interspeech 2023

Via

Access Paper or Ask Questions

Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition

Jun 28, 2023

Yuang Li, Yu Wu, Jinyu Li, Shujie Liu

Abstract:The integration of Language Models (LMs) has proven to be an effective way to address domain shifts in speech recognition. However, these approaches usually require a significant amount of target domain text data for the training of LMs. Different from these methods, in this work, with only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA, a 7-billion-parameter large language model (LLM). LLM is used in two ways: 1) second-pass rescoring: reranking N-best hypotheses of a given ASR system with LLaMA; 2) deep LLM-fusion: incorporating LLM into the decoder of an encoder-decoder based ASR system. Experiments show that, with only one domain prompt, both methods can effectively reduce word error rates (WER) on out-of-domain TedLium-2 and SPGISpeech datasets. Especially, the deep LLM-fusion has the advantage of better recall of entity and out-of-vocabulary words.

Via

Access Paper or Ask Questions

Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

Jun 17, 2023

Yongqi Yang, Ruoyu Wang, Zhihao Qian, Ye Zhu, Yu Wu

Figure 1 for Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

Figure 2 for Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

Figure 3 for Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

Figure 4 for Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

Abstract:Text-to-Image (T2I) generation with diffusion models allows users to control the semantic content in the synthesized images given text conditions. As a further step toward a more customized image creation application, we introduce a new multi-modality generation setting that synthesizes images based on not only the semantic-level textual input but also on the pixel-level visual conditions. Existing literature first converts the given visual information to semantic-level representation by connecting it to languages, and then incorporates it into the original denoising process. Seemingly intuitive, such methodological design loses the pixel values during the semantic transition, thus failing to fulfill the task scenario where the preservation of low-level vision is desired (e.g., ID of a given face image). To this end, we propose Cyclic One-Way Diffusion (COW), a training-free framework for creating customized images with respect to semantic text and pixel-visual conditioning. Notably, we observe that sub-regions of an image impose mutual interference, just like physical diffusion, to achieve ultimate harmony along the denoising trajectory. Thus we propose to repetitively utilize the given visual condition in a cyclic way, by planting the visual condition as a high-concentration "seed" at the initialization step of the denoising process, and "diffuse" it into a harmonious picture by controlling a one-way information flow from the visual condition. We repeat the destroy-and-construct process multiple times to gradually but steadily impose the internal diffusion process within the image. Experiments on the challenging one-shot face and text-conditioned image synthesis task demonstrate our superiority in terms of speed, image quality, and conditional fidelity compared to learning-based text-vision conditional methods. Project page is available at: https://bigaandsmallq.github.io/COW/

* Project page is available at: https://bigaandsmallq.github.io/COW/

Via

Access Paper or Ask Questions

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Jun 14, 2023

Yingying Fan, Yu Wu, Yutian Lin, Bo Du

Figure 1 for Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Figure 2 for Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Figure 3 for Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Figure 4 for Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Abstract:We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities. Previous works only concentrate on video-level overall label denoising across modalities, but overlook the segment-level label noise, where adjacent video segments (i.e., 1-second video clips) may contain different events. However, recognizing events in the segment is challenging because its label could be any combination of events that occur in the video. To address this issue, we consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels. Specifically, we design language prompts to describe all cases of event appearance for each video. Then, the similarity between language prompts and segments is calculated, where the event of the most similar prompt is regarded as the segment-level label. In addition, to deal with the mislabeled segments, we propose to perform dynamic re-weighting on the unreliable segments to adjust their labels. Experiments show that our simple yet effective approach outperforms state-of-the-art methods by a large margin.

Via

Access Paper or Ask Questions

1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation

Jun 08, 2023

Tao Zhang, Xingye Tian, Haoran Wei, Yu Wu, Shunping Ji, Xuebo Wang, Xin Tao, Yuan Zhang, Pengfei Wan

Figure 1 for 1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation

Figure 2 for 1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation

Figure 3 for 1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation

Figure 4 for 1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation

Abstract:Video panoptic segmentation is a challenging task that serves as the cornerstone of numerous downstream applications, including video editing and autonomous driving. We believe that the decoupling strategy proposed by DVIS enables more effective utilization of temporal information for both "thing" and "stuff" objects. In this report, we successfully validated the effectiveness of the decoupling strategy in video panoptic segmentation. Finally, our method achieved a VPQ score of 51.4 and 53.7 in the development and test phases, respectively, and ultimately ranked 1st in the VPS track of the 2nd PVUW Challenge. The code is available at https://github.com/zhang-tao-whu/DVIS

Via

Access Paper or Ask Questions

DVIS: Decoupled Video Instance Segmentation Framework

Jun 08, 2023

Tao Zhang, Xingye Tian, Yu Wu, Shunping Ji, Xuebo Wang, Yuan Zhang, Pengfei Wan

Abstract:Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the \textbf{D}ecoupled \textbf{VIS} framework (\textbf{DVIS}). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 1.69\% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. The code is available at \href{https://github.com/zhang-tao-whu/DVIS}{https://github.com/zhang-tao-whu/DVIS}.

Via

Access Paper or Ask Questions

Accurate and Structured Pruning for Efficient Automatic Speech Recognition

May 31, 2023

Huiqiang Jiang, Li Lyna Zhang, Yuang Li, Yu Wu, Shijie Cao, Ting Cao, Yuqing Yang, Jinyu Li, Mao Yang, Lili Qiu

Figure 1 for Accurate and Structured Pruning for Efficient Automatic Speech Recognition

Figure 2 for Accurate and Structured Pruning for Efficient Automatic Speech Recognition

Figure 3 for Accurate and Structured Pruning for Efficient Automatic Speech Recognition

Figure 4 for Accurate and Structured Pruning for Efficient Automatic Speech Recognition

Abstract:Automatic Speech Recognition (ASR) has seen remarkable advancements with deep neural networks, such as Transformer and Conformer. However, these models typically have large model sizes and high inference costs, posing a challenge to deploy on resource-limited devices. In this paper, we propose a novel compression strategy that leverages structured pruning and knowledge distillation to reduce the model size and inference cost of the Conformer model while preserving high recognition performance. Our approach utilizes a set of binary masks to indicate whether to retain or prune each Conformer module, and employs L0 regularization to learn the optimal mask values. To further enhance pruning performance, we use a layerwise distillation strategy to transfer knowledge from unpruned to pruned models. Our method outperforms all pruning baselines on the widely used LibriSpeech benchmark, achieving a 50% reduction in model size and a 28% reduction in inference cost with minimal performance loss.

* Accepted at INTERSPEECH 2023

Via

Access Paper or Ask Questions

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

May 25, 2023

Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, Jinyu Li, Furu Wei

Figure 1 for VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

Figure 2 for VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

Figure 3 for VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

Figure 4 for VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

Abstract:Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional codec language model task via multi-task learning framework. To accomplish this, we first convert all the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence conversion problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.

* Working in progress

Via

Access Paper or Ask Questions

Click-Feedback Retrieval

Apr 28, 2023

Zeyu Wang, Yu Wu

Abstract:Retrieving target information based on input query is of fundamental importance in many real-world applications. In practice, it is not uncommon for the initial search to fail, where additional feedback information is needed to guide the searching process. In this work, we study a setting where the feedback is provided through users clicking liked and disliked searching results. We believe this form of feedback is of great practical interests for its convenience and efficiency. To facilitate future work in this direction, we construct a new benchmark termed click-feedback retrieval based on a large-scale dataset in fashion domain. We demonstrate that incorporating click-feedback can drastically improve the retrieval performance, which validates the value of the proposed setting. We also introduce several methods to utilize click-feedback during training, and show that click-feedback-guided training can significantly enhance the retrieval quality. We hope further exploration in this direction can bring new insights on building more efficient and user-friendly search engines.

Via

Access Paper or Ask Questions

Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

Mar 23, 2023

Qifan Yu, Juncheng Li, Yu Wu, Siliang Tang, Wei Ji, Yueting Zhuang

Figure 1 for Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

Figure 2 for Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

Figure 3 for Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

Figure 4 for Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

Abstract:Scene Graph Generation (SGG) aims to extract <subject, predicate, object> relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution issues that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to haddle it via prior rules but are still confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction.

* 21 pages, 16 figures

Via

Access Paper or Ask Questions