Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuan Zhao

Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes

Dec 02, 2024

Xiaoqi Zhao, Youwei Pang, Shijie Chang, Yuan Zhao, Lihe Zhang, Huchuan Lu, Jinsong Ouyang, Georges El Fakhri, Xiaofeng Liu

Abstract:As a foundational model, SAM has significantly influenced multiple fields within computer vision, and its upgraded version, SAM 2, enhances capabilities in video segmentation, poised to make a substantial impact once again. While SAMs (SAM and SAM 2) have demonstrated excellent performance in segmenting context-independent concepts like people, cars, and roads, they overlook more challenging context-dependent (CD) concepts, such as visual saliency, camouflage, product defects, and medical lesions. CD concepts rely heavily on global and local contextual information, making them susceptible to shifts in different contexts, which requires strong discriminative capabilities from the model. The lack of comprehensive evaluation of SAMs limits understanding of their performance boundaries, which may hinder the design of future models. In this paper, we conduct a thorough quantitative evaluation of SAMs on 11 CD concepts across 2D and 3D images and videos in various visual modalities within natural, medical, and industrial scenes. We develop a unified evaluation framework for SAM and SAM 2 that supports manual, automatic, and intermediate self-prompting, aided by our specific prompt generation and interaction strategies. We further explore the potential of SAM 2 for in-context learning and introduce prompt robustness testing to simulate real-world imperfect prompts. Finally, we analyze the benefits and limitations of SAMs in understanding CD concepts and discuss their future development in segmentation tasks. This work aims to provide valuable insights to guide future research in both context-independent and context-dependent concepts segmentation, potentially informing the development of the next version - SAM 3.

Via

Access Paper or Ask Questions

Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models

Oct 15, 2024

Kai Yao, Penlei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, Jianke Zhu

Figure 1 for Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models

Figure 2 for Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models

Figure 3 for Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models

Figure 4 for Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods have gained significant popularity for adapting pre-trained Large Language Models (LLMs) to downstream tasks, primarily due to their potential to significantly reduce memory and computational overheads. However, a common limitation in most PEFT approaches is their application of a uniform architectural design across all layers. This uniformity involves identical trainable modules and ignores the varying importance of each layer, leading to sub-optimal fine-tuning results. To overcome the above limitation and obtain better performance, we develop a novel approach, Importance-aware Sparse Tuning (IST), to fully utilize the inherent sparsity and select the most important subset of full layers with effective layer-wise importance scoring. The proposed IST is a versatile and plug-and-play technique compatible with various PEFT methods that operate on a per-layer basis. By leveraging the estimated importance scores, IST dynamically updates these selected layers in PEFT modules, leading to reduced memory demands. We further provide theoretical proof of convergence and empirical evidence of superior performance to demonstrate the advantages of IST over uniform updating strategies. Extensive experiments on a range of LLMs, PEFTs, and downstream tasks substantiate the effectiveness of our proposed method, showcasing IST's capacity to enhance existing layer-based PEFT methods. Our code is available at https://github.com/Kaiseem/IST.

* EMNLP 2024

Via

Access Paper or Ask Questions

MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Aug 21, 2024

Yuan Zhao, Zhenqi Jia, Rui Liu, De Hu, Feilong Bao, Guanglai Gao

Figure 1 for MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Figure 2 for MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Figure 3 for MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Figure 4 for MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Abstract:Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf{MCDubber}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at https://github.com/XiaoYuanJun-zy/MCDubber.

Via

Access Paper or Ask Questions

Large-scale variational Gaussian state-space models

Mar 03, 2024

Matthew Dowling, Yuan Zhao, Il Memming Park

Figure 1 for Large-scale variational Gaussian state-space models

Figure 2 for Large-scale variational Gaussian state-space models

Figure 3 for Large-scale variational Gaussian state-space models

Abstract:We introduce an amortized variational inference algorithm and structured variational approximation for state-space models with nonlinear dynamics driven by Gaussian noise. Importantly, the proposed framework allows for efficient evaluation of the ELBO and low-variance stochastic gradient estimates without resorting to diagonal Gaussian approximations by exploiting (i) the low-rank structure of Monte-Carlo approximations to marginalize the latent state through the dynamics (ii) an inference network that approximates the update step with low-rank precision matrix updates (iii) encoding current and future observations into pseudo observations -- transforming the approximate smoothing problem into an (easier) approximate filtering problem. Overall, the necessary statistics and ELBO can be computed in $O(TL(Sr + S^2 + r^2))$ time where $T$ is the series length, $L$ is the state-space dimensionality, $S$ are the number of samples used to approximate the predict step statistics, and $r$ is the rank of the approximate precision matrix update in the update step (which can be made of much lower dimension than $L$).

Via

Access Paper or Ask Questions

MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially Euphemistic Terms

Jan 25, 2024

Patrick Lee, Alain Chirino Trujillo, Diana Cuevas Plancarte, Olumide Ebenezer Ojo, Xinyi Liu, Iyanuoluwa Shode, Yuan Zhao, Jing Peng, Anna Feldman

Figure 1 for MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially Euphemistic Terms

Figure 2 for MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially Euphemistic Terms

Figure 3 for MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially Euphemistic Terms

Figure 4 for MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially Euphemistic Terms

Abstract:This study investigates the computational processing of euphemisms, a universal linguistic phenomenon, across multiple languages. We train a multilingual transformer model (XLM-RoBERTa) to disambiguate potentially euphemistic terms (PETs) in multilingual and cross-lingual settings. In line with current trends, we demonstrate that zero-shot learning across languages takes place. We also show cases where multilingual models perform better on the task compared to monolingual models by a statistically significant margin, indicating that multilingual data presents additional opportunities for models to learn about cross-lingual, computational properties of euphemisms. In a follow-up analysis, we focus on universal euphemistic "categories" such as death and bodily functions among others. We test to see whether cross-lingual data of the same domain is more important than within-language data of other domains to further understand the nature of the cross-lingual transfer.

Via

Access Paper or Ask Questions

FEED PETs: Further Experimentation and Expansion on the Disambiguation of Potentially Euphemistic Terms

Jun 06, 2023

Patrick Lee, Iyanuoluwa Shode, Alain Chirino Trujillo, Yuan Zhao, Olumide Ebenezer Ojo, Diana Cuevas Plancarte, Anna Feldman, Jing Peng

Figure 1 for FEED PETs: Further Experimentation and Expansion on the Disambiguation of Potentially Euphemistic Terms

Figure 2 for FEED PETs: Further Experimentation and Expansion on the Disambiguation of Potentially Euphemistic Terms

Figure 3 for FEED PETs: Further Experimentation and Expansion on the Disambiguation of Potentially Euphemistic Terms

Figure 4 for FEED PETs: Further Experimentation and Expansion on the Disambiguation of Potentially Euphemistic Terms

Abstract:Transformers have been shown to work well for the task of English euphemism disambiguation, in which a potentially euphemistic term (PET) is classified as euphemistic or non-euphemistic in a particular context. In this study, we expand on the task in two ways. First, we annotate PETs for vagueness, a linguistic property associated with euphemisms, and find that transformers are generally better at classifying vague PETs, suggesting linguistic differences in the data that impact performance. Second, we present novel euphemism corpora in three different languages: Yoruba, Spanish, and Mandarin Chinese. We perform euphemism disambiguation experiments in each language using multilingual transformer models mBERT and XLM-RoBERTa, establishing preliminary results from which to launch future work.

Via

Access Paper or Ask Questions

Linear Time GPs for Inferring Latent Trajectories from Neural Spike Trains

Jun 01, 2023

Matthew Dowling, Yuan Zhao, Il Memming Park

Figure 1 for Linear Time GPs for Inferring Latent Trajectories from Neural Spike Trains

Figure 2 for Linear Time GPs for Inferring Latent Trajectories from Neural Spike Trains

Figure 3 for Linear Time GPs for Inferring Latent Trajectories from Neural Spike Trains

Figure 4 for Linear Time GPs for Inferring Latent Trajectories from Neural Spike Trains

Abstract:Latent Gaussian process (GP) models are widely used in neuroscience to uncover hidden state evolutions from sequential observations, mainly in neural activity recordings. While latent GP models provide a principled and powerful solution in theory, the intractable posterior in non-conjugate settings necessitates approximate inference schemes, which may lack scalability. In this work, we propose cvHM, a general inference framework for latent GP models leveraging Hida-Mat\'ern kernels and conjugate computation variational inference (CVI). With cvHM, we are able to perform variational inference of latent neural trajectories with linear time complexity for arbitrary likelihoods. The reparameterization of stationary kernels using Hida-Mat\'ern GPs helps us connect the latent variable models that encode prior assumptions through dynamical systems to those that encode trajectory assumptions through GPs. In contrast to previous work, we use bidirectional information filtering, leading to a more concise implementation. Furthermore, we employ the Whittle approximate likelihood to achieve highly efficient hyperparameter learning.

* Published at ICML 2023

Via

Access Paper or Ask Questions

Real-Time Variational Method for Learning Neural Trajectory and its Dynamics

May 18, 2023

Matthew Dowling, Yuan Zhao, Il Memming Park

Figure 1 for Real-Time Variational Method for Learning Neural Trajectory and its Dynamics

Figure 2 for Real-Time Variational Method for Learning Neural Trajectory and its Dynamics

Figure 3 for Real-Time Variational Method for Learning Neural Trajectory and its Dynamics

Figure 4 for Real-Time Variational Method for Learning Neural Trajectory and its Dynamics

Abstract:Latent variable models have become instrumental in computational neuroscience for reasoning about neural computation. This has fostered the development of powerful offline algorithms for extracting latent neural trajectories from neural recordings. However, despite the potential of real time alternatives to give immediate feedback to experimentalists, and enhance experimental design, they have received markedly less attention. In this work, we introduce the exponential family variational Kalman filter (eVKF), an online recursive Bayesian method aimed at inferring latent trajectories while simultaneously learning the dynamical system generating them. eVKF works for arbitrary likelihoods and utilizes the constant base measure exponential family to model the latent state stochasticity. We derive a closed-form variational analogue to the predict step of the Kalman filter which leads to a provably tighter bound on the ELBO compared to another online variational method. We validate our method on synthetic and real-world data, and, notably, show that it achieves competitive performance

* Published at ICLR 2023

Via

Access Paper or Ask Questions

Low-frequency Image Deep Steganography: Manipulate the Frequency Distribution to Hide Secrets with Tenacious Robustness

Mar 23, 2023

Huajie Chen, Tianqing Zhu, Yuan Zhao, Bo Liu, Xin Yu, Wanlei Zhou

Figure 1 for Low-frequency Image Deep Steganography: Manipulate the Frequency Distribution to Hide Secrets with Tenacious Robustness

Figure 2 for Low-frequency Image Deep Steganography: Manipulate the Frequency Distribution to Hide Secrets with Tenacious Robustness

Figure 3 for Low-frequency Image Deep Steganography: Manipulate the Frequency Distribution to Hide Secrets with Tenacious Robustness

Figure 4 for Low-frequency Image Deep Steganography: Manipulate the Frequency Distribution to Hide Secrets with Tenacious Robustness

Abstract:Image deep steganography (IDS) is a technique that utilizes deep learning to embed a secret image invisibly into a cover image to generate a container image. However, the container images generated by convolutional neural networks (CNNs) are vulnerable to attacks that distort their high-frequency components. To address this problem, we propose a novel method called Low-frequency Image Deep Steganography (LIDS) that allows frequency distribution manipulation in the embedding process. LIDS extracts a feature map from the secret image and adds it to the cover image to yield the container image. The container image is not directly output by the CNNs, and thus, it does not contain high-frequency artifacts. The extracted feature map is regulated by a frequency loss to ensure that its frequency distribution mainly concentrates on the low-frequency domain. To further enhance robustness, an attack layer is inserted to damage the container image. The retrieval network then retrieves a recovered secret image from a damaged container image. Our experiments demonstrate that LIDS outperforms state-of-the-art methods in terms of robustness, while maintaining high fidelity and specificity. By avoiding high-frequency artifacts and manipulating the frequency distribution of the embedded feature map, LIDS achieves improved robustness against attacks that distort the high-frequency components of container images.

Via

Access Paper or Ask Questions

OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for Multi-Camera 3D Object Detection

Jan 13, 2023

Xiaomeng Chu, Jiajun Deng, Yuan Zhao, Jianmin Ji, Yu Zhang, Houqiang Li, Yanyong Zhang

Abstract:The recent trend for multi-camera 3D object detection is through the unified bird's-eye view (BEV) representation. However, directly transforming features extracted from the image-plane view to BEV inevitably results in feature distortion, especially around the objects of interest, making the objects blur into the background. To this end, we propose OA-BEV, a network that can be plugged into the BEV-based 3D object detection framework to bring out the objects by incorporating object-aware pseudo-3D features and depth features. Such features contain information about the object's position and 3D structures. First, we explicitly guide the network to learn the depth distribution by object-level supervision from each 3D object's center. Then, we select the foreground pixels by a 2D object detector and project them into 3D space for pseudo-voxel feature encoding. Finally, the object-aware depth features and pseudo-voxel features are incorporated into the BEV representation with a deformable attention mechanism. We conduct extensive experiments on the nuScenes dataset to validate the merits of our proposed OA-BEV. Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score. Our codes will be published.

Via

Access Paper or Ask Questions