Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ping Jian

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

Jan 07, 2026

Zhongbin Guo, Zhen Yang, Yushan Li, Xinyue Zhang, Wenyu Gao, Jiacheng Wang, Chengzhi Li, Xiangrui Liu, Ping Jian

Abstract:Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant "spatial gap" remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at https://github.com/binisalegend/SiT-Bench .

Via

Access Paper or Ask Questions

Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

Nov 18, 2025

Zhongbin Guo, Jiahe Liu, Yushan Li, Wenyu Gao, Zhen Yang, Chenzhi Li, Xinyue Zhang, Ping Jian

Figure 1 for Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

Figure 2 for Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

Figure 3 for Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

Figure 4 for Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

Abstract:Existing Vision Language Models (VLMs) architecturally rooted in "flatland" perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an "Embedding-as-Value" paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.

Via

Access Paper or Ask Questions

TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting

Jun 23, 2025

Zhongbin Guo, Yuhao Wang, Ping Jian, Xinyue Chen, Wei Peng, Ertai E

Figure 1 for TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting

Figure 2 for TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting

Figure 3 for TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting

Figure 4 for TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting

Abstract:Satellite image time-series analysis demands fine-grained spatial-temporal reasoning, which remains a challenge for existing multimodal large language models (MLLMs). In this work, we study the capabilities of MLLMs on a novel task that jointly targets temporal change understanding and future scene generation, aiming to assess their potential for modeling complex multimodal dynamics over time. We propose TAMMs, a Temporal-Aware Multimodal Model for satellite image change understanding and forecasting, which enhances frozen MLLMs with lightweight temporal modules for structured sequence encoding and contextual prompting. To guide future image generation, TAMMs introduces a Semantic-Fused Control Injection (SFCI) mechanism that adaptively combines high-level semantic reasoning and structural priors within an enhanced ControlNet. This dual-path conditioning enables temporally consistent and semantically grounded image synthesis. Experiments demonstrate that TAMMs outperforms strong MLLM baselines in both temporal change understanding and future image forecasting tasks, highlighting how carefully designed temporal reasoning and semantic fusion can unlock the full potential of MLLMs for spatio-temporal understanding.

* Submitted to the 33rd ACM International Conference on Multimedia. Our dataset can be found at https://huggingface.co/datasets/IceInPot/TAMMs

Via

Access Paper or Ask Questions

Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension

Sep 24, 2024

Chenxu Wang, Ping Jian, Zhen Yang

Figure 1 for Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension

Figure 2 for Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension

Figure 3 for Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension

Figure 4 for Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension

Abstract:Logical reading comprehension is a challenging task that entails grasping the underlying semantics of text and applying reasoning to deduce the correct answer. Prior researches have primarily focused on enhancing logical reasoning capabilities through Chain-of-Thought (CoT) or data augmentation. However, previous work constructing chain-of-thought rationales concentrates solely on analyzing correct options, neglecting the incorrect alternatives. Addtionally, earlier efforts on data augmentation by altering contexts rely on rule-based methods, which result in generated contexts that lack diversity and coherence. To address these issues, we propose a Premise-Oriented Data Augmentation (PODA) framework. This framework can generate CoT rationales including analyses for both correct and incorrect options, while constructing diverse and high-quality counterfactual contexts from incorrect candidate options. We integrate summarizing premises and identifying premises for each option into rationales. Subsequently, we employ multi-step prompts with identified premises to construct counterfactual context. To facilitate the model's capabilities to better differentiate the reasoning process associated with each option, we introduce a novel thought-path contrastive learning method that compares reasoning paths between the original and counterfactual samples. Experimental results on three representative LLMs demonstrate that our method can improve the baselines substantially across two challenging logical reasoning benchmarks (ReClor and LogiQA 2.0). The data and code are released at https://github.com/lalalamdbf/TPReasoner.

Via

Access Paper or Ask Questions

Prompt-based Logical Semantics Enhancement for Implicit Discourse Relation Recognition

Nov 01, 2023

Chenxu Wang, Ping Jian, Mu Huang

Figure 1 for Prompt-based Logical Semantics Enhancement for Implicit Discourse Relation Recognition

Figure 2 for Prompt-based Logical Semantics Enhancement for Implicit Discourse Relation Recognition

Figure 3 for Prompt-based Logical Semantics Enhancement for Implicit Discourse Relation Recognition

Figure 4 for Prompt-based Logical Semantics Enhancement for Implicit Discourse Relation Recognition

Abstract:Implicit Discourse Relation Recognition (IDRR), which infers discourse relations without the help of explicit connectives, is still a crucial and challenging task for discourse parsing. Recent works tend to exploit the hierarchical structure information from the annotated senses, which demonstrate enhanced discourse relation representations can be obtained by integrating sense hierarchy. Nevertheless, the performance and robustness for IDRR are significantly constrained by the availability of annotated data. Fortunately, there is a wealth of unannotated utterances with explicit connectives, that can be utilized to acquire enriched discourse relation features. In light of such motivation, we propose a Prompt-based Logical Semantics Enhancement (PLSE) method for IDRR. Essentially, our method seamlessly injects knowledge relevant to discourse relation into pre-trained language models through prompt-based connective prediction. Furthermore, considering the prompt-based connective prediction exhibits local dependencies due to the deficiency of masked language model (MLM) in capturing global semantics, we design a novel self-supervised learning objective based on mutual information maximization to derive enhanced representations of logical semantics for IDRR. Experimental results on PDTB 2.0 and CoNLL16 datasets demonstrate that our method achieves outstanding and consistent performance against the current state-of-the-art models.

* This paper is accepted by the EMNLP 2023 Main Conference

Via

Access Paper or Ask Questions

Teach model to answer questions after comprehending the document

Jul 18, 2023

Ruiqing Sun, Ping Jian

Figure 1 for Teach model to answer questions after comprehending the document

Figure 2 for Teach model to answer questions after comprehending the document

Figure 3 for Teach model to answer questions after comprehending the document

Figure 4 for Teach model to answer questions after comprehending the document

Abstract:Multi-choice Machine Reading Comprehension (MRC) is a challenging extension of Natural Language Processing (NLP) that requires the ability to comprehend the semantics and logical relationships between entities in a given text. The MRC task has traditionally been viewed as a process of answering questions based on the given text. This single-stage approach has often led the network to concentrate on generating the correct answer, potentially neglecting the comprehension of the text itself. As a result, many prevalent models have faced challenges in performing well on this task when dealing with longer texts. In this paper, we propose a two-stage knowledge distillation method that teaches the model to better comprehend the document by dividing the MRC task into two separate stages. Our experimental results show that the student model, when equipped with our method, achieves significant improvements, demonstrating the effectiveness of our method.

Via

Access Paper or Ask Questions

MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for Answer Selection

Oct 10, 2020

Yingxue Zhang, Fandong Meng, Peng Li, Ping Jian, Jie Zhou

Figure 1 for MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for Answer Selection

Figure 2 for MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for Answer Selection

Figure 3 for MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for Answer Selection

Figure 4 for MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for Answer Selection

Abstract:As conventional answer selection (AS) methods generally match the question with each candidate answer independently, they suffer from the lack of matching information between the question and the candidate. To address this problem, we propose a novel reinforcement learning (RL) based multi-step ranking model, named MS-Ranker, which accumulates information from potentially correct candidate answers as extra evidence for matching the question with a candidate. In specific, we explicitly consider the potential correctness of candidates and update the evidence with a gating mechanism. Moreover, as we use a listwise ranking reward, our model learns to pay more attention to the overall performance. Experiments on two benchmarks, namely WikiQA and SemEval-2016 CQA, show that our model significantly outperforms existing methods that do not rely on external resources.

Via

Access Paper or Ask Questions

Tag Recommendation by Word-Level Tag Sequence Modeling

Nov 30, 2019

Xuewen Shi, Heyan Huang, Shuyang Zhao, Ping Jian, Yi-Kun Tang

Figure 1 for Tag Recommendation by Word-Level Tag Sequence Modeling

Figure 2 for Tag Recommendation by Word-Level Tag Sequence Modeling

Figure 3 for Tag Recommendation by Word-Level Tag Sequence Modeling

Figure 4 for Tag Recommendation by Word-Level Tag Sequence Modeling

Abstract:In this paper, we transform tag recommendation into a word-based text generation problem and introduce a sequence-to-sequence model. The model inherits the advantages of LSTM-based encoder for sequential modeling and attention-based decoder with local positional encodings for learning relations globally. Experimental results on Zhihu datasets illustrate the proposed model outperforms other state-of-the-art text classification based methods.

* This is a full length version of the paper in DASFAA 2019

Via

Access Paper or Ask Questions

Neural Chinese Word Segmentation as Sequence to Sequence Translation

Nov 29, 2019

Xuewen Shi, Heyan Huang, Ping Jian, Yuhang Guo, Xiaochi Wei, Yi-Kun Tang

Figure 1 for Neural Chinese Word Segmentation as Sequence to Sequence Translation

Figure 2 for Neural Chinese Word Segmentation as Sequence to Sequence Translation

Figure 3 for Neural Chinese Word Segmentation as Sequence to Sequence Translation

Figure 4 for Neural Chinese Word Segmentation as Sequence to Sequence Translation

Abstract:Recently, Chinese word segmentation (CWS) methods using neural networks have made impressive progress. Most of them regard the CWS as a sequence labeling problem which construct models based on local features rather than considering global information of input sequence. In this paper, we cast the CWS as a sequence translation problem and propose a novel sequence-to-sequence CWS model with an attention-based encoder-decoder framework. The model captures the global information from the input and directly outputs the segmented sequence. It can also tackle other NLP tasks with CWS jointly in an end-to-end mode. Experiments on Weibo, PKU and MSRA benchmark datasets show that our approach has achieved competitive performances compared with state-of-the-art methods. Meanwhile, we successfully applied our proposed model to jointly learning CWS and Chinese spelling correction, which demonstrates its applicability of multi-task fusion.

* In proceedings of SMP 2017 (Chinese National Conference on Social Media Processing)

Via

Access Paper or Ask Questions

Semantic Graph Convolutional Network for Implicit Discourse Relation Classification

Oct 21, 2019

Yingxue Zhang, Ping Jian, Fandong Meng, Ruiying Geng, Wei Cheng, Jie Zhou

Figure 1 for Semantic Graph Convolutional Network for Implicit Discourse Relation Classification

Figure 2 for Semantic Graph Convolutional Network for Implicit Discourse Relation Classification

Figure 3 for Semantic Graph Convolutional Network for Implicit Discourse Relation Classification

Figure 4 for Semantic Graph Convolutional Network for Implicit Discourse Relation Classification

Abstract:Implicit discourse relation classification is of great importance for discourse parsing, but remains a challenging problem due to the absence of explicit discourse connectives communicating these relations. Modeling the semantic interactions between the two arguments of a relation has proven useful for detecting implicit discourse relations. However, most previous approaches model such semantic interactions from a shallow interactive level, which is inadequate on capturing enough semantic information. In this paper, we propose a novel and effective Semantic Graph Convolutional Network (SGCN) to enhance the modeling of inter-argument semantics on a deeper interaction level for implicit discourse relation classification. We first build an interaction graph over representations of the two arguments, and then automatically extract in-depth semantic interactive information through graph convolution. Experimental results on the English corpus PDTB and the Chinese corpus CDTB both demonstrate the superiority of our model to previous state-of-the-art systems.

* 8 pages, 4 figures

Via

Access Paper or Ask Questions