Abstract:Vision-language pre-training (VLP) models have shown vulnerability to adversarial examples in multimodal tasks. Furthermore, malicious adversaries can be deliberately transferred to attack other black-box models. However, existing work has mainly focused on investigating white-box attacks. In this paper, we present the first study to investigate the adversarial transferability of recent VLP models. We observe that existing methods exhibit much lower transferability, compared to the strong attack performance in white-box settings. The transferability degradation is partly caused by the under-utilization of cross-modal interactions. Particularly, unlike unimodal learning, VLP models rely heavily on cross-modal interactions and the multimodal alignments are many-to-many, e.g., an image can be described in various natural languages. To this end, we propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance. Experimental results demonstrate that SGA could generate adversarial examples that can strongly transfer across different VLP models on multiple downstream vision-language tasks. On image-text retrieval, SGA significantly enhances the attack success rate for transfer attacks from ALBEF to TCL by a large margin (at least 9.78% and up to 30.21%), compared to the state-of-the-art.
Abstract:Launchpad is a musical instrument that allows users to create and perform music by pressing illuminated buttons. To assist and inspire the design of the Launchpad light effect, and provide a more accessible approach for beginners to create music visualization with this instrument, we proposed the LaunchpadGPT model to generate music visualization designs on Launchpad automatically. Based on the language model with excellent generation ability, our proposed LaunchpadGPT takes an audio piece of music as input and outputs the lighting effects of Launchpad-playing in the form of a video (Launchpad-playing video). We collect Launchpad-playing videos and process them to obtain music and corresponding video frame of Launchpad-playing as prompt-completion pairs, to train the language model. The experiment result shows the proposed method can create better music visualization than random generation methods and hold the potential for a broader range of music visualization applications. Our code is available at https://github.com/yunlong10/LaunchpadGPT/.
Abstract:The problem of how to assess cross-modality medical image synthesis has been largely unexplored. The most used measures like PSNR and SSIM focus on analyzing the structural features but neglect the crucial lesion location and fundamental k-space speciality of medical images. To overcome this problem, we propose a new metric K-CROSS to spur progress on this challenging problem. Specifically, K-CROSS uses a pre-trained multi-modality segmentation network to predict the lesion location, together with a tumor encoder for representing features, such as texture details and brightness intensities. To further reflect the frequency-specific information from the magnetic resonance imaging principles, both k-space features and vision features are obtained and employed in our comprehensive encoders with a frequency reconstruction penalty. The structure-shared encoders are designed and constrained with a similarity loss to capture the intrinsic common structural information for both modalities. As a consequence, the features learned from lesion regions, k-space, and anatomical structures are all captured, which serve as our quality evaluators. We evaluate the performance by constructing a large-scale cross-modality neuroimaging perceptual similarity (NIRPS) dataset with 6,000 radiologist judgments. Extensive experiments demonstrate that the proposed method outperforms other metrics, especially in comparison with the radiologists on NIRPS.
Abstract:Our winning entry for the CVPR 2023 Generic Event Boundary Captioning (GEBC) competition is detailed in this paper. Unlike conventional video captioning tasks, GEBC demands that the captioning model possess an understanding of immediate changes in status around the designated video boundary, making it a difficult task. This paper proposes an effective model LLMVA-GEBC (Large Language Model with Video Adapter for Generic Event Boundary Captioning): (1) We utilize a pretrained LLM for generating human-like captions with high quality. (2) To adapt the model to the GEBC task, we take the video Q-former as an adapter and train it with the frozen visual feature extractors and LLM. Our proposed method achieved a 76.14 score on the test set and won the first place in the challenge. Our code is available at https://github.com/zjr2000/LLMVA-GEBC .
Abstract:Extremely large-scale Array (ELAA) promises to deliver ultra-high data rates with more antenna elements. Meanwhile, the increase of antenna elements leads to a wider realm of near-field, which challenges the traditional design of codebooks. In this paper, we propose novel codebook design schemes which provide better quantized correlation with limited overhead. First, we analyze the correlation between codewords and channel vectors uniform linear array (ULA) and uniform planar array (UPA). The correlation formula for the ULA channel can be expressed as an elliptic function, and the correlation formula for the UPA channel can be represented as an ellipsoid formula. Based on the analysis, we design a uniform sampling codebook to maximize the minimum quantized correlation and a dislocation ULA codebook to reduce the number of quantized bits further. Besides, we give a better sampling interval for the codebook of the UPA channel. Numerical results demonstrate the appealing advantages of the proposed codebook over existing methods in quantization bit number and quantization accuracy.
Abstract:Recently, Transformers have emerged as the go-to architecture for both vision and language modeling tasks, but their computational efficiency is limited by the length of the input sequence. To address this, several efficient variants of Transformers have been proposed to accelerate computation or reduce memory consumption while preserving performance. This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens, bringing several technical contributions: 1) Convolutional activation is used to pre-process the token after patchifying the image to select and rearrange the major tokens and minor tokens, which substantially reduces the computation cost through an additional fusion layer. 2) Instead of using the class activation map of the convolutional model directly, we design a new weighted class activation to lower the model requirements. 3) To facilitate communication between major tokens and fusion tokens, Gated Linear SRA is proposed to further integrate fusion tokens into the attention mechanism. We perform a comprehensive validation of CageViT on the image classification challenge. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency, while maintaining a comparable level of accuracy (e.g. a moderate-sized 43.35M model trained solely on 224 x 224 ImageNet-1K can achieve Top-1 accuracy of 83.4% accuracy).
Abstract:Recently, the Segment Anything Model (SAM) gains lots of attention rapidly due to its impressive segmentation performance on images. Regarding its strong ability on image segmentation and high interactivity with different prompts, we found that it performs poorly on consistent segmentation in videos. Therefore, in this report, we propose Track Anything Model (TAM), which achieves high-performance interactive tracking and segmentation in videos. To be detailed, given a video sequence, only with very little human participation, i.e., several clicks, people can track anything they are interested in, and get satisfactory results in one-pass inference. Without additional training, such an interactive design performs impressively on video object tracking and segmentation. All resources are available on {https://github.com/gaomingqi/Track-Anything}. We hope this work can facilitate related research.
Abstract:Decentralized minimax optimization has been actively studied in the past few years due to its application in a wide range of machine learning models. However, the current theoretical understanding of its convergence rate is far from satisfactory since existing works only focus on the nonconvex-strongly-concave problem. This motivates us to study decentralized minimax optimization algorithms for the nonconvex-nonconcave problem. To this end, we develop two novel decentralized stochastic variance-reduced gradient descent ascent algorithms for the finite-sum nonconvex-nonconcave problem that satisfies the Polyak-{\L}ojasiewicz (PL) condition. In particular, our theoretical analyses demonstrate how to conduct local updates and perform communication to achieve the linear convergence rate. To the best of our knowledge, this is the first work achieving linear convergence rates for decentralized nonconvex-nonconcave problems. Finally, we verify the performance of our algorithms on both synthetic and real-world datasets. The experimental results confirm the efficacy of our algorithms.
Abstract:Data augmentation is a promising technique for unsupervised anomaly detection in industrial applications, where the availability of positive samples is often limited due to factors such as commercial competition and sample collection difficulties. In this paper, how to effectively select and apply data augmentation methods for unsupervised anomaly detection is studied. The impact of various data augmentation methods on different anomaly detection algorithms is systematically investigated through experiments. The experimental results show that the performance of different industrial image anomaly detection (termed as IAD) algorithms is not significantly affected by the specific data augmentation method employed and that combining multiple data augmentation methods does not necessarily yield further improvements in the accuracy of anomaly detection, although it can achieve excellent results on specific methods. These findings provide useful guidance on selecting appropriate data augmentation methods for different requirements in IAD.
Abstract:Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task.