Abstract:Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.
Abstract:AI systems increasingly produce fluent, correct, end-to-end outcomes. Over time, this erodes users' ability to explain, verify, or intervene. We define this divergence as the Capability-Comprehension Gap: a decoupling where assisted performance improves while users' internal models deteriorate. This paper argues that prevailing approaches to transparency, user control, literacy, and governance do not define the foundational understanding humans must retain for oversight under sustained AI delegation. To formalize this, we define the Cognitive Integrity Threshold (CIT) as the minimum comprehension required to preserve oversight, autonomy, and accountable participation under AI assistance. CIT does not require full reasoning reconstruction, nor does it constrain automation. It identifies the threshold beyond which oversight becomes procedural and contestability fails. We operatinalize CIT through three functional dimensions: (i) verification capacity, (ii) comprehension-preserving interaction, and (iii) institutional scaffolds for governance. This motivates a design and governance agenda that aligns human-AI interaction with cognitive sustainability in responsibility-critical settings.




Abstract:Multivariate time series forecasting is a challenging task because the data involves a mixture of long- and short-term patterns, with dynamic spatio-temporal dependencies among variables. Existing graph neural networks (GNN) typically model multivariate relationships with a pre-defined spatial graph or learned fixed adjacency graph. It limits the application of GNN and fails to handle the above challenges. In this paper, we propose a novel framework, namely static- and dynamic-graph learning-neural network (SDGL). The model acquires static and dynamic graph matrices from data to model long- and short-term patterns respectively. Static matric is developed to capture the fixed long-term association pattern via node embeddings, and we leverage graph regularity for controlling the quality of the learned static graph. To capture dynamic dependencies among variables, we propose dynamic graphs learning method to generate time-varying matrices based on changing node features and static node embeddings. And in the method, we integrate the learned static graph information as inductive bias to construct dynamic graphs and local spatio-temporal patterns better. Extensive experiments are conducted on two traffic datasets with extra structural information and four time series datasets, which show that our approach achieves state-of-the-art performance on almost all datasets. If the paper is accepted, I will open the source code on github.