Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ying Liu

Tsinghua University

Stateful Cross-layer Vision Modulation

Feb 28, 2026

Ying Liu, Yudong Han, Kean Shi, Liyuan Pan

Abstract:Recent multimodal large language models (MLLMs) widely adopt multi-layer visual feature fusion to enhance visual representation. However, existing approaches typically perform static concatenation or weighted aggregation after visual encoding, without intervening in the representation formation process itself. As a result, fine-grained details from early layers may be progressively suppressed during hierarchical abstraction. Moreover, directly introducing shallow-layer features into the language model often leads to semantic distribution mismatch with the visual feature space that the LLM's cross-attention layers were pretrained on, which typically requires additional adaptation or fine-tuning of the LLM. To address these limitations, we revisit visual representation learning from the perspective of representation evolution control and propose a cross-layer memory-modulated vision framework(SCVM). Specifically, we introduce a recursively updated cross-layer memory state inside the vision encoder to model long-range inter-layer dependencies. We further design a layer-wise feedback modulation mechanism that refreshes token representations at each layer based on the accumulated memory, thereby structurally regulating the representation evolution trajectory. In addition, we incorporate an auxiliary semantic alignment objective that explicitly supervises the final memory state, encouraging progressive compression and reinforcement of task-relevant information. Experimental results on multiple visual question answering and hallucination evaluation benchmarks demonstrate that SCVM achieves consistent performance improvements without expanding visual tokens, introducing additional vision encoders, or modifying or fine-tuning the language model.

Via

Access Paper or Ask Questions

MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

Feb 06, 2026

Wenjie Wang, Wei Wu, Ying Liu, Yuan Zhao, Xiaole Lv, Liang Diao, Zengjian Fan, Wenfeng Xie, Ziling Lin, De Shi(+3 more)

Abstract:Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.

* 20 pages, 8 figures. Technical report

Via

Access Paper or Ask Questions

Gesture Recognition from body-Worn RFID under Missing Data

Jan 22, 2026

Sahar Golipoor, Richard T. Brophy, Ying Liu, Reza Ghazalian, Stephan Sigg

Abstract:We explore hand-gesture recognition through the use of passive body-worn reflective tags. A data processing pipeline is proposed to address the issue of missing data. Specifically, missing information is recovered through linear and exponential interpolation and extrapolation. Furthermore, imputation and proximity-based inference are employed. We represent tags as nodes in a temporal graph, with edges formed based on correlations between received signal strength (RSS) and phase values across successive timestamps, and we train a graph-based convolutional neural network that exploits graph-based self-attention. The system outperforms state-of-the-art methods with an accuracy of 98.13% for the recognition of 21 gestures. We achieve 89.28% accuracy under leave-one-person-out cross-validation. We further investigate the contribution of various body locations on the recognition accuracy. Removing tags from the arms reduces accuracy by more than 10%, while removing the wrist tag only reduces accuracy by around 2%. Therefore, tag placements on the arms are more expressive for gesture recognition than on the wrist.

Via

Access Paper or Ask Questions

SymDrive: Realistic and Controllable Driving Simulator via Symmetric Auto-regressive Online Restoration

Dec 25, 2025

Zhiyuan Liu, Daocheng Fu, Pinlong Cai, Lening Wang, Ying Liu, Yilong Ren, Botian Shi, Jianqiang Wang

Figure 1 for SymDrive: Realistic and Controllable Driving Simulator via Symmetric Auto-regressive Online Restoration

Figure 2 for SymDrive: Realistic and Controllable Driving Simulator via Symmetric Auto-regressive Online Restoration

Figure 3 for SymDrive: Realistic and Controllable Driving Simulator via Symmetric Auto-regressive Online Restoration

Figure 4 for SymDrive: Realistic and Controllable Driving Simulator via Symmetric Auto-regressive Online Restoration

Abstract:High-fidelity and controllable 3D simulation is essential for addressing the long-tail data scarcity in Autonomous Driving (AD), yet existing methods struggle to simultaneously achieve photorealistic rendering and interactive traffic editing. Current approaches often falter in large-angle novel view synthesis and suffer from geometric or lighting artifacts during asset manipulation. To address these challenges, we propose SymDrive, a unified diffusion-based framework capable of joint high-quality rendering and scene editing. We introduce a Symmetric Auto-regressive Online Restoration paradigm, which constructs paired symmetric views to recover fine-grained details via a ground-truth-guided dual-view formulation and utilizes an auto-regressive strategy for consistent lateral view generation. Furthermore, we leverage this restoration capability to enable a training-free harmonization mechanism, treating vehicle insertion as context-aware inpainting to ensure seamless lighting and shadow consistency. Extensive experiments demonstrate that SymDrive achieves state-of-the-art performance in both novel-view enhancement and realistic 3D vehicle insertion.

Via

Access Paper or Ask Questions

Bridging the Gap Between Bayesian Deep Learning and Ensemble Weather Forecasts

Nov 18, 2025

Xinlei Xiong, Wenbo Hu, Shuxun Zhou, Kaifeng Bi, Lingxi Xie, Ying Liu, Richang Hong, Qi Tian

Abstract:Weather forecasting is fundamentally challenged by the chaotic nature of the atmosphere, necessitating probabilistic approaches to quantify uncertainty. While traditional ensemble prediction (EPS) addresses this through computationally intensive simulations, recent advances in Bayesian Deep Learning (BDL) offer a promising but often disconnected alternative. We bridge these paradigms through a unified hybrid Bayesian Deep Learning framework for ensemble weather forecasting that explicitly decomposes predictive uncertainty into epistemic and aleatoric components, learned via variational inference and a physics-informed stochastic perturbation scheme modeling flow-dependent atmospheric dynamics, respectively. We further establish a unified theoretical framework that rigorously connects BDL and EPS, providing formal theorems that decompose total predictive uncertainty into epistemic and aleatoric components under the hybrid BDL framework. We validate our framework on the large-scale 40-year ERA5 reanalysis dataset (1979-2019) with 0.25° spatial resolution. Experimental results show that our method not only improves forecast accuracy and yields better-calibrated uncertainty quantification but also achieves superior computational efficiency compared to state-of-the-art probabilistic diffusion models. We commit to making our code open-source upon acceptance of this paper.

Via

Access Paper or Ask Questions

RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion

Nov 13, 2025

Wenzhe He, Xiaojun Chen, Wentang Chen, Hongyu Wang, Ying Liu, Ruihui Li

Figure 1 for RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion

Figure 2 for RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion

Figure 3 for RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion

Figure 4 for RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion

Abstract:Semantic Scene Completion (SSC) aims to generate a complete semantic scene from an incomplete input. Existing approaches often employ dense network architectures with a high parameter count, leading to increased model complexity and resource demands. To address these limitations, we propose RWKV-PCSSC, a lightweight point cloud semantic scene completion network inspired by the Receptance Weighted Key Value (RWKV) mechanism. Specifically, we introduce a RWKV Seed Generator (RWKV-SG) module that can aggregate features from a partial point cloud to produce a coarse point cloud with coarse features. Subsequently, the point-wise feature of the point cloud is progressively restored through multiple stages of the RWKV Point Deconvolution (RWKV-PD) modules. By leveraging a compact and efficient design, our method achieves a lightweight model representation. Experimental results demonstrate that RWKV-PCSSC reduces the parameter count by 4.18$\times$ and improves memory efficiency by 1.37$\times$ compared to state-of-the-art methods PointSSC. Furthermore, our network achieves state-of-the-art performance on established indoor (SSC-PC, NYUCAD-PC) and outdoor (PointSSC) scene dataset, as well as on our proposed datasets (NYUCAD-PC-V2, 3D-FRONT-PC).

* Proc. 33rd ACM Int. Conf. Multimedia (MM '25), Dublin, Ireland, 2025, pp. 161-170
* 13 pages, 8 figures, published to ACM MM

Via

Access Paper or Ask Questions

Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark

Mar 20, 2025

Ying Liu, Yijing Hua, Haojiang Chai, Yanbo Wang, TengQi Ye

Abstract:Open-vocabulary detectors are proposed to locate and recognize objects in novel classes. However, variations in vision-aware language vocabulary data used for open-vocabulary learning can lead to unfair and unreliable evaluations. Recent evaluation methods have attempted to address this issue by incorporating object properties or adding locations and characteristics to the captions. Nevertheless, since these properties and locations depend on the specific details of the images instead of classes, detectors can not make accurate predictions without precise descriptions provided through human annotation. This paper introduces 3F-OVD, a novel task that extends supervised fine-grained object detection to the open-vocabulary setting. Our task is intuitive and challenging, requiring a deep understanding of Fine-grained captions and careful attention to Fine-grained details in images in order to accurately detect Fine-grained objects. Additionally, due to the scarcity of qualified fine-grained object detection datasets, we have created a new dataset, NEU-171K, tailored for both supervised and open-vocabulary settings. We benchmark state-of-the-art object detectors on our dataset for both settings. Furthermore, we propose a simple yet effective post-processing technique.

* 8 pages, 4 figures

Via

Access Paper or Ask Questions

Exploring the Small World of Word Embeddings: A Comparative Study on Conceptual Spaces from LLMs of Different Scales

Feb 17, 2025

Zhu Liu, Ying Liu, KangYang Luo, Cunliang Kong, Maosong Sun

Abstract:A conceptual space represents concepts as nodes and semantic relatedness as edges. Word embeddings, combined with a similarity metric, provide an effective approach to constructing such a space. Typically, embeddings are derived from traditional distributed models or encoder-only pretrained models, whose objectives directly capture the meaning of the current token. In contrast, decoder-only models, including large language models (LLMs), predict the next token, making their embeddings less directly tied to the current token's semantics. Moreover, comparative studies on LLMs of different scales remain underexplored. In this paper, we construct a conceptual space using word embeddings from LLMs of varying scales and comparatively analyze their properties. We establish a network based on a linguistic typology-inspired connectivity hypothesis, examine global statistical properties, and compare LLMs of varying scales. Locally, we analyze conceptual pairs, WordNet relations, and a cross-lingual semantic network for qualitative words. Our results indicate that the constructed space exhibits small-world properties, characterized by a high clustering coefficient and short path lengths. Larger LLMs generate more intricate spaces, with longer paths reflecting richer relational structures and connections. Furthermore, the network serves as an efficient bridge for cross-lingual semantic mapping.

* Paper under review

Via

Access Paper or Ask Questions

Semi-Implicit Neural Ordinary Differential Equations

Dec 15, 2024

Hong Zhang, Ying Liu, Romit Maulik

Abstract:Classical neural ODEs trained with explicit methods are intrinsically limited by stability, crippling their efficiency and robustness for stiff learning problems that are common in graph learning and scientific machine learning. We present a semi-implicit neural ODE approach that exploits the partitionable structure of the underlying dynamics. Our technique leads to an implicit neural network with significant computational advantages over existing approaches because of enhanced stability and efficient linear solves during time integration. We show that our approach outperforms existing approaches on a variety of applications including graph classification and learning complex dynamical systems. We also demonstrate that our approach can train challenging neural ODEs where both explicit methods and fully implicit methods are intractable.

Via

Access Paper or Ask Questions

A Top-down Graph-based Tool for Modeling Classical Semantic Maps: A Crosslinguistic Case Study of Supplementary Adverbs

Dec 02, 2024

Zhu Liu, Cunliang Kong, Ying Liu, Maosong Sun

Figure 1 for A Top-down Graph-based Tool for Modeling Classical Semantic Maps: A Crosslinguistic Case Study of Supplementary Adverbs

Figure 2 for A Top-down Graph-based Tool for Modeling Classical Semantic Maps: A Crosslinguistic Case Study of Supplementary Adverbs

Figure 3 for A Top-down Graph-based Tool for Modeling Classical Semantic Maps: A Crosslinguistic Case Study of Supplementary Adverbs

Figure 4 for A Top-down Graph-based Tool for Modeling Classical Semantic Maps: A Crosslinguistic Case Study of Supplementary Adverbs

Abstract:Semantic map models (SMMs) construct a network-like conceptual space from cross-linguistic instances or forms, based on the connectivity hypothesis. This approach has been widely used to represent similarity and entailment relationships in cross-linguistic concept comparisons. However, most SMMs are manually built by human experts using bottom-up procedures, which are often labor-intensive and time-consuming. In this paper, we propose a novel graph-based algorithm that automatically generates conceptual spaces and SMMs in a top-down manner. The algorithm begins by creating a dense graph, which is subsequently pruned into maximum spanning trees, selected according to metrics we propose. These evaluation metrics include both intrinsic and extrinsic measures, considering factors such as network structure and the trade-off between precision and coverage. A case study on cross-linguistic supplementary adverbs demonstrates the effectiveness and efficiency of our model compared to human annotations and other automated methods. The tool is available at \url{https://github.com/RyanLiut/SemanticMapModel}.

* Paper under review

Via

Access Paper or Ask Questions