Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Linjie Yang

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Dec 19, 2023

Mingfei Han, Linjie Yang, Xiaojun Chang, Heng Wang

Abstract:A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.

* See https://mingfei.info/shot2story for updates and more information

Via

Access Paper or Ask Questions

Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling

Oct 11, 2023

Haogeng Liu, Qihang Fan, Tingkai Liu, Linjie Yang, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang

Abstract:This paper proposes Video-Teller, a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment to significantly enhance the video-to-text generation task. Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules. It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions. To effectively integrate visual and auditory information, Video-Teller builds upon the image-based BLIP-2 model and introduces a cascaded Q-Former which fuses information across frames and ASR texts. To better guide video summarization, we introduce a fine-grained modality alignment objective, where the cascaded Q-Former's output embedding is trained to align with the caption/summary embedding created by a pretrained text auto-encoder. Experimental results demonstrate the efficacy of our proposed video-language foundation model in accurately comprehending videos and generating coherent and precise language descriptions. It is worth noting that the fine-grained alignment enhances the model's capabilities (4% improvement of CIDEr score on MSR-VTT) with only 13% extra parameters in training and zero additional cost in inference.

Via

Access Paper or Ask Questions

Selective Feature Adapter for Dense Vision Transformers

Oct 03, 2023

Xueqing Deng, Qi Fan, Xiaojie Jin, Linjie Yang, Peng Wang

Figure 1 for Selective Feature Adapter for Dense Vision Transformers

Figure 2 for Selective Feature Adapter for Dense Vision Transformers

Figure 3 for Selective Feature Adapter for Dense Vision Transformers

Figure 4 for Selective Feature Adapter for Dense Vision Transformers

Abstract:Fine-tuning pre-trained transformer models, e.g., Swin Transformer, are successful in numerous downstream for dense prediction vision tasks. However, one major issue is the cost/storage of their huge amount of parameters, which becomes increasingly challenging to handle with the growing amount of vision tasks. In this paper, we propose an effective approach to alleviate the issue, namely selective feature adapter (SFA). It achieves state-of-the-art (SoTA) performance under any given budget of trainable parameters, and demonstrates comparable or better performance than fully fine-tuned models across various dense tasks. Specifically, SFA consists of external adapters and internal adapters which are sequentially operated over a transformer model. For external adapters, we properly select the places and amount of additional multilayer perception (MLP). For internal adapters, we transform a few task-important parameters inside the transformer, which are automatically discovered through a simple yet effective lottery ticket algorithm. Our experiments show that the dual adapter module, a.k.a SFA, is essential to achieve the best trade-off on dense vision tasks, such as segmentation, detection and depth-estimation, outperforming other adapters with a single module.

Via

Access Paper or Ask Questions

The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

Sep 27, 2023

Haichao Yu, Yu Tian, Sateesh Kumar, Linjie Yang, Heng Wang

Figure 1 for The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

Figure 2 for The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

Figure 3 for The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

Figure 4 for The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

Abstract:The quality of pre-training data plays a critical role in the performance of foundation models. Popular foundation models often design their own recipe for data filtering, which makes it hard to analyze and compare different data filtering approaches. DataComp is a new benchmark dedicated to evaluating different methods for data filtering. This paper describes our learning and solution when participating in the DataComp challenge. Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment. We integrate existing methods and propose new solutions, such as computing CLIP score on horizontally flipped images to mitigate the interference of scene text, using vision and language models to retrieve training samples for target downstream tasks, rebalancing the data distribution to improve the efficiency of allocating the computational budget, etc. We slice and dice our design choices, provide in-depth analysis, and discuss open questions. Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.

* 12 pages, 10 figures

Via

Access Paper or Ask Questions

Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation

Jul 27, 2023

Yiming Cui, Linjie Yang, Haichao Yu

Figure 1 for Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation

Figure 2 for Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation

Figure 3 for Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation

Figure 4 for Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation

Abstract:Transformer-based detection and segmentation methods use a list of learned detection queries to retrieve information from the transformer network and learn to predict the location and category of one specific object from each query. We empirically find that random convex combinations of the learned queries are still good for the corresponding models. We then propose to learn a convex combination with dynamic coefficients based on the high-level semantics of the image. The generated dynamic queries, named modulated queries, better capture the prior of object locations and categories in the different images. Equipped with our modulated queries, a wide range of DETR-based models achieve consistent and superior performance across multiple tasks including object detection, instance segmentation, panoptic segmentation, and video instance segmentation.

* 12 pages, 4 figures, ICML 2023, code is available at https://github.com/bytedance/DQ-Det

Via

Access Paper or Ask Questions

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

Jul 22, 2023

Cheng-En Wu, Yu Tian, Haichao Yu, Heng Wang, Pedro Morgado, Yu Hen Hu, Linjie Yang

Figure 1 for Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

Figure 2 for Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

Figure 3 for Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

Figure 4 for Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

Abstract:Vision-language models such as CLIP learn a generic text-image embedding from large-scale training data. A vision-language model can be adapted to a new classification task through few-shot prompt tuning. We find that such a prompt tuning process is highly robust to label noises. This intrigues us to study the key reasons contributing to the robustness of the prompt tuning paradigm. We conducted extensive experiments to explore this property and find the key factors are: 1) the fixed classname tokens provide a strong regularization to the optimization of the model, reducing gradients induced by the noisy samples; 2) the powerful pre-trained image-text embedding that is learned from diverse and generic web data provides strong prior knowledge for image classification. Further, we demonstrate that noisy zero-shot predictions from CLIP can be used to tune its own prompt, significantly enhancing prediction accuracy in the unsupervised setting. The code is available at https://github.com/CEWu/PTNL.

* Accepted by ICCV2023

Via

Access Paper or Ask Questions

Exploring the Role of Audio in Video Captioning

Jun 21, 2023

Yuhan Shen, Linjie Yang, Longyin Wen, Haichao Yu, Ehsan Elhamifar, Heng Wang

Abstract:Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M. Though these approaches have achieved significant improvement, the audio modality is often ignored in video captioning. In this work, we present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. Instead of relying on text transcripts extracted via automatic speech recognition (ASR), we argue that learning with raw audio signals can be more beneficial, as audio has additional information including acoustic events, speaker identity, etc. Our contributions are twofold. First, we observed that the model overspecializes to the audio modality when pre-training with both video and audio modality, since the ground truth (i.e., text transcripts) can be solely predicted using audio. We proposed a Modality Balanced Pre-training (MBP) loss to mitigate this issue and significantly improve the performance on downstream tasks. Second, we slice and dice different design choices of the cross-modal module, which may become an information bottleneck and generate inferior results. We proposed new local-global fusion mechanisms to improve information exchange across audio and video. We demonstrate significant improvements by leveraging the audio modality on four datasets, and even outperform the state of the art on some metrics without relying on the text modality as the input.

Via

Access Paper or Ask Questions

$R^{2}$Former: Unified $R$etrieval and $R$eranking Transformer for Place Recognition

Apr 06, 2023

Sijie Zhu, Linjie Yang, Chen Chen, Mubarak Shah, Xiaohui Shen, Heng Wang

$Figure 1 for $R^{2}$Former: Unified $R$etrieval and $R$eranking Transformer for Place Recognition$

$Figure 2 for $R^{2}$Former: Unified $R$etrieval and $R$eranking Transformer for Place Recognition$

$Figure 3 for $R^{2}$Former: Unified $R$etrieval and $R$eranking Transformer for Place Recognition$

$Figure 4 for $R^{2}$Former: Unified $R$etrieval and $R$eranking Transformer for Place Recognition$

Abstract:Visual Place Recognition (VPR) estimates the location of query images by matching them with images in a reference database. Conventional methods generally adopt aggregated CNN features for global retrieval and RANSAC-based geometric verification for reranking. However, RANSAC only employs geometric information but ignores other possible information that could be useful for reranking, e.g. local feature correlations, and attention values. In this paper, we propose a unified place recognition framework that handles both retrieval and reranking with a novel transformer model, named $R^{2}$Former. The proposed reranking module takes feature correlation, attention value, and xy coordinates into account, and learns to determine whether the image pair is from the same location. The whole pipeline is end-to-end trainable and the reranking module alone can also be adopted on other CNN or transformer backbones as a generic component. Remarkably, $R^{2}$Former significantly outperforms state-of-the-art methods on major VPR datasets with much less inference time and memory consumption. It also achieves the state-of-the-art on the hold-out MSLS challenge set and could serve as a simple yet strong solution for real-world large-scale applications. Experiments also show vision transformer tokens are comparable and sometimes better than CNN local features on local matching. The code is released at https://github.com/Jeff-Zilence/R2Former.

* CVPR

Via

Access Paper or Ask Questions

FAQ: Feature Aggregated Queries for Transformer-based Video Object Detectors

Mar 20, 2023

Yiming Cui, Linjie Yang

Abstract:Video object detection needs to solve feature degradation situations that rarely happen in the image domain. One solution is to use the temporal information and fuse the features from the neighboring frames. With Transformerbased object detectors getting a better performance on the image domain tasks, recent works began to extend those methods to video object detection. However, those existing Transformer-based video object detectors still follow the same pipeline as those used for classical object detectors, like enhancing the object feature representations by aggregation. In this work, we take a different perspective on video object detection. In detail, we improve the qualities of queries for the Transformer-based models by aggregation. To achieve this goal, we first propose a vanilla query aggregation module that weighted averages the queries according to the features of the neighboring frames. Then, we extend the vanilla module to a more practical version, which generates and aggregates queries according to the features of the input frames. Extensive experimental results validate the effectiveness of our proposed methods: On the challenging ImageNet VID benchmark, when integrated with our proposed modules, the current state-of-the-art Transformer-based object detectors can be improved by more than 2.4% on mAP and 4.2% on AP50.

* 12 pages, 4 figures

Via

Access Paper or Ask Questions

Revisiting Training-free NAS Metrics: An Efficient Training-based Method

Nov 16, 2022

Taojiannan Yang, Linjie Yang, Xiaojie Jin, Chen Chen

Abstract:Recent neural architecture search (NAS) works proposed training-free metrics to rank networks which largely reduced the search cost in NAS. In this paper, we revisit these training-free metrics and find that: (1) the number of parameters (\#Param), which is the most straightforward training-free metric, is overlooked in previous works but is surprisingly effective, (2) recent training-free metrics largely rely on the \#Param information to rank networks. Our experiments show that the performance of recent training-free metrics drops dramatically when the \#Param information is not available. Motivated by these observations, we argue that metrics less correlated with the \#Param are desired to provide additional information for NAS. We propose a light-weight training-based metric which has a weak correlation with the \#Param while achieving better performance than training-free metrics at a lower search cost. Specifically, on DARTS search space, our method completes searching directly on ImageNet in only 2.6 GPU hours and achieves a top-1/top-5 error rate of 24.1\%/7.1\%, which is competitive among state-of-the-art NAS methods. Codes are available at \url{https://github.com/taoyang1122/Revisit_TrainingFree_NAS}

* Accepted to WACV2023

Via

Access Paper or Ask Questions