Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yan Yang

Growing a Twig to Accelerate Large Vision-Language Models

Mar 18, 2025

Zhenwei Shao, Mingyang Wang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Ning Mao, Wei Chen, Jun Yu

Figure 1 for Growing a Twig to Accelerate Large Vision-Language Models

Figure 2 for Growing a Twig to Accelerate Large Vision-Language Models

Figure 3 for Growing a Twig to Accelerate Large Vision-Language Models

Figure 4 for Growing a Twig to Accelerate Large Vision-Language Models

Abstract:Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM -- a simple and general architecture by growing a lightweight twig upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods. Code will be made publicly available.

* 17 pages, 8 figures

Via

Access Paper or Ask Questions

ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

Mar 10, 2025

Yan Yang, Dongxu Li, Haoning Wu, Bei Chen, Liu Liu, Liyuan Pan, Junnan Li

Abstract:Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries that require professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently submitted by professionals based on their daily productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning, thus providing valuable directions for future multimodal AI research efforts.

Via

Access Paper or Ask Questions

Spatial Transcriptomics Analysis of Spatially Dense Gene Expression Prediction

Mar 03, 2025

Ruikun Zhang, Yan Yang, Liyuan Pan

Abstract:Spatial transcriptomics (ST) measures gene expression at fine-grained spatial resolution, offering insights into tissue molecular landscapes. Previous methods for spatial gene expression prediction usually crop spots of interest from pathology tissue slide images, and learn a model that maps each spot to a single gene expression profile. However, it fundamentally loses spatial resolution of gene expression: 1) each spot often contains multiple cells with distinct gene expression; 2) spots are cropped at fixed resolutions, limiting the ability to predict gene expression at varying spatial scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from pathology images. Different from previous methods that map individual spots to gene expression values, we generate a dense continuous gene expression map from the pathology image, and aggregate values within spots of interest to predict the gene expression. Our PixNet outperforms state-of-the-art methods on 3 common ST datasets, while showing superior performance in predicting gene expression across multiple spatial scales. The source code will be publicly available.

Via

Access Paper or Ask Questions

A space-decoupling framework for optimization on bounded-rank matrices with orthogonally invariant constraints

Jan 23, 2025

Yan Yang, Bin Gao, Ya-xiang Yuan

Figure 1 for A space-decoupling framework for optimization on bounded-rank matrices with orthogonally invariant constraints

Figure 2 for A space-decoupling framework for optimization on bounded-rank matrices with orthogonally invariant constraints

Figure 3 for A space-decoupling framework for optimization on bounded-rank matrices with orthogonally invariant constraints

Figure 4 for A space-decoupling framework for optimization on bounded-rank matrices with orthogonally invariant constraints

Abstract:Imposing additional constraints on low-rank optimization has garnered growing interest. However, the geometry of coupled constraints hampers the well-developed low-rank structure and makes the problem intricate. To this end, we propose a space-decoupling framework for optimization on bounded-rank matrices with orthogonally invariant constraints. The ``space-decoupling" is reflected in several ways. We show that the tangent cone of coupled constraints is the intersection of tangent cones of each constraint. Moreover, we decouple the intertwined bounded-rank and orthogonally invariant constraints into two spaces, leading to optimization on a smooth manifold. Implementing Riemannian algorithms on this manifold is painless as long as the geometry of additional constraints is known. In addition, we unveil the equivalence between the reformulated problem and the original problem. Numerical experiments on real-world applications -- spherical data fitting, graph similarity measuring, low-rank SDP, model reduction of Markov processes, reinforcement learning, and deep learning -- validate the superiority of the proposed framework.

* 48 pages, 12 figures, 6 tables

Via

Access Paper or Ask Questions

High Resolution Tree Height Mapping of the Amazon Forest using Planet NICFI Images and LiDAR-Informed U-Net Model

Jan 17, 2025

Fabien H Wagner, Ricardo Dalagnol, Griffin Carter, Mayumi CM Hirye, Shivraj Gill, Le Bienfaiteur Sagang Takougoum, Samuel Favrichon, Michael Keller, Jean PHB Ometto, Lorena Alves(+12 more)

Figure 1 for High Resolution Tree Height Mapping of the Amazon Forest using Planet NICFI Images and LiDAR-Informed U-Net Model

Figure 2 for High Resolution Tree Height Mapping of the Amazon Forest using Planet NICFI Images and LiDAR-Informed U-Net Model

Figure 3 for High Resolution Tree Height Mapping of the Amazon Forest using Planet NICFI Images and LiDAR-Informed U-Net Model

Figure 4 for High Resolution Tree Height Mapping of the Amazon Forest using Planet NICFI Images and LiDAR-Informed U-Net Model

Abstract:Tree canopy height is one of the most important indicators of forest biomass, productivity, and ecosystem structure, but it is challenging to measure accurately from the ground and from space. Here, we used a U-Net model adapted for regression to map the mean tree canopy height in the Amazon forest from Planet NICFI images at ~4.78 m spatial resolution for the period 2020-2024. The U-Net model was trained using canopy height models computed from aerial LiDAR data as a reference, along with their corresponding Planet NICFI images. Predictions of tree heights on the validation sample exhibited a mean error of 3.68 m and showed relatively low systematic bias across the entire range of tree heights present in the Amazon forest. Our model successfully estimated canopy heights up to 40-50 m without much saturation, outperforming existing canopy height products from global models in this region. We determined that the Amazon forest has an average canopy height of ~22 m. Events such as logging or deforestation could be detected from changes in tree height, and encouraging results were obtained to monitor the height of regenerating forests. These findings demonstrate the potential for large-scale mapping and monitoring of tree height for old and regenerating Amazon forests using Planet NICFI imagery.

* will be submitted to the journal Remote Sensing of Environment in February 2025

Via

Access Paper or Ask Questions

Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning

Dec 16, 2024

Zhuyang Xie, Yan Yang, Yankai Yu, Jie Wang, Yongquan Jiang, Xiao Wu

Figure 1 for Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning

Figure 2 for Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning

Figure 3 for Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning

Figure 4 for Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning

Abstract:Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level, using these concepts to enhance video features and provide temporal event cues; and (2) design cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, we perform weakly supervised concept detection for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to obtain more discriminative concept embeddings. In the captioning network, we establish a cyclic co-learning strategy where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator's event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.

* Accepted at AAAI 2025

Via

Access Paper or Ask Questions

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Dec 02, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Yunhang Shen, Chunjiang Ge, Yan Yang, Zuwei Long, Yuhan Dai, Tong Xu, Xing Sun(+3 more)

Figure 1 for T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Figure 2 for T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Figure 3 for T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Figure 4 for T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Abstract:The success of Multimodal Large Language Models (MLLMs) in the image domain has garnered wide attention from the research community. Drawing on previous successful experiences, researchers have recently explored extending the success to the video understanding realms. Apart from training from scratch, an efficient way is to utilize the pre-trained image-LLMs, leading to two mainstream approaches, i.e. zero-shot inference and further fine-tuning with video data. In this work, our study of these approaches harvests an effective data augmentation method. We first make a deeper inspection of the zero-shot inference way and identify two limitations, i.e. limited generalization and lack of temporal understanding capabilities. Thus, we further investigate the fine-tuning approach and find a low learning efficiency when simply using all the video data samples, which can be attributed to a lack of instruction diversity. Aiming at this issue, we develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus. Integrating these data enables a simple and efficient training scheme, which achieves performance comparable to or even superior to using full video datasets by training with just 15% the sample size. Meanwhile, we find that the proposed scheme can boost the performance of long video understanding without training with long video samples. We hope our study will spark more thinking about using MLLMs for video understanding and curation of high-quality data. The code is released at https://github.com/xjtupanda/T2Vid.

* Project page: https://github.com/xjtupanda/T2Vid

Via

Access Paper or Ask Questions

SpikMamba: When SNN meets Mamba in Event-based Human Action Recognition

Oct 22, 2024

Jiaqi Chen, Yan Yang, Shizhuo Deng, Da Teng, Liyuan Pan

Figure 1 for SpikMamba: When SNN meets Mamba in Event-based Human Action Recognition

Figure 2 for SpikMamba: When SNN meets Mamba in Event-based Human Action Recognition

Figure 3 for SpikMamba: When SNN meets Mamba in Event-based Human Action Recognition

Figure 4 for SpikMamba: When SNN meets Mamba in Event-based Human Action Recognition

Abstract:Human action recognition (HAR) plays a key role in various applications such as video analysis, surveillance, autonomous driving, robotics, and healthcare. Most HAR algorithms are developed from RGB images, which capture detailed visual information. However, these algorithms raise concerns in privacy-sensitive environments due to the recording of identifiable features. Event cameras offer a promising solution by capturing scene brightness changes sparsely at the pixel level, without capturing full images. Moreover, event cameras have high dynamic ranges that can effectively handle scenarios with complex lighting conditions, such as low light or high contrast environments. However, using event cameras introduces challenges in modeling the spatially sparse and high temporal resolution event data for HAR. To address these issues, we propose the SpikMamba framework, which combines the energy efficiency of spiking neural networks and the long sequence modeling capability of Mamba to efficiently capture global features from spatially sparse and high a temporal resolution event data. Additionally, to improve the locality of modeling, a spiking window-based linear attention mechanism is used. Extensive experiments show that SpikMamba achieves remarkable recognition performance, surpassing the previous state-of-the-art by 1.45%, 7.22%, 0.15%, and 3.92% on the PAF, HARDVS, DVS128, and E-FAction datasets, respectively. The code is available at https://github.com/Typistchen/SpikMamba.

* 8 pages, 4 figures

Via

Access Paper or Ask Questions

LMHaze: Intensity-aware Image Dehazing with a Large-scale Multi-intensity Real Haze Dataset

Oct 21, 2024

Ruikun Zhang, Hao Yang, Yan Yang, Ying Fu, Liyuan Pan

Figure 1 for LMHaze: Intensity-aware Image Dehazing with a Large-scale Multi-intensity Real Haze Dataset

Figure 2 for LMHaze: Intensity-aware Image Dehazing with a Large-scale Multi-intensity Real Haze Dataset

Figure 3 for LMHaze: Intensity-aware Image Dehazing with a Large-scale Multi-intensity Real Haze Dataset

Figure 4 for LMHaze: Intensity-aware Image Dehazing with a Large-scale Multi-intensity Real Haze Dataset

Abstract:Image dehazing has drawn a significant attention in recent years. Learning-based methods usually require paired hazy and corresponding ground truth (haze-free) images for training. However, it is difficult to collect real-world image pairs, which prevents developments of existing methods. Although several works partially alleviate this issue by using synthetic datasets or small-scale real datasets. The haze intensity distribution bias and scene homogeneity in existing datasets limit the generalization ability of these methods, particularly when encountering images with previously unseen haze intensities. In this work, we present LMHaze, a large-scale, high-quality real-world dataset. LMHaze comprises paired hazy and haze-free images captured in diverse indoor and outdoor environments, spanning multiple scenarios and haze intensities. It contains over 5K high-resolution image pairs, surpassing the size of the biggest existing real-world dehazing dataset by over 25 times. Meanwhile, to better handle images with different haze intensities, we propose a mixture-of-experts model based on Mamba (MoE-Mamba) for dehazing, which dynamically adjusts the model parameters according to the haze intensity. Moreover, with our proposed dataset, we conduct a new large multimodal model (LMM)-based benchmark study to simulate human perception for evaluating dehazed images. Experiments demonstrate that LMHaze dataset improves the dehazing performance in real scenarios and our dehazing method provides better results compared to state-of-the-art methods.

Via

Access Paper or Ask Questions

Storyboard guided Alignment for Fine-grained Video Action Recognition

Oct 18, 2024

Enqi Liu, Liyuan Pan, Yan Yang, Yiran Zhong, Zhijing Wu, Xinxiao Wu, Liu Liu

Abstract:Fine-grained video action recognition can be conceptualized as a video-text matching problem. Previous approaches often rely on global video semantics to consolidate video embeddings, which can lead to misalignment in video-text pairs due to a lack of understanding of action semantics at an atomic granularity level. To tackle this challenge, we propose a multi-granularity framework based on two observations: (i) videos with different global semantics may share similar atomic actions or appearances, and (ii) atomic actions within a video can be momentary, slow, or even non-directly related to the global video semantics. Inspired by the concept of storyboarding, which disassembles a script into individual shots, we enhance global video semantics by generating fine-grained descriptions using a pre-trained large language model. These detailed descriptions capture common atomic actions depicted in videos. A filtering metric is proposed to select the descriptions that correspond to the atomic actions present in both the videos and the descriptions. By employing global semantics and fine-grained descriptions, we can identify key frames in videos and utilize them to aggregate embeddings, thereby making the embedding more accurate. Extensive experiments on various video action recognition datasets demonstrate superior performance of our proposed method in supervised, few-shot, and zero-shot settings.

Via

Access Paper or Ask Questions