Abstract:While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.
Abstract:Trajectory prediction for multi-agents in complex scenarios is crucial for applications like autonomous driving. However, existing methods often overlook environmental biases, which leads to poor generalization. Additionally, hardware constraints limit the use of large-scale data across environments, and continual learning settings exacerbate the challenge of catastrophic forgetting. To address these issues, we propose the Continual Causal Intervention (C$^{2}$INet) method for generalizable multi-agent trajectory prediction within a continual learning framework. Using variational inference, we align environment-related prior with posterior estimator of confounding factors in the latent space, thereby intervening in causal correlations that affect trajectory representation. Furthermore, we store optimal variational priors across various scenarios using a memory queue, ensuring continuous debiasing during incremental task training. The proposed C$^{2}$INet enhances adaptability to diverse tasks while preserving previous task information to prevent catastrophic forgetting. It also incorporates pruning strategies to mitigate overfitting. Comparative evaluations on three real and synthetic complex datasets against state-of-the-art methods demonstrate that our proposed method consistently achieves reliable prediction performance, effectively mitigating confounding factors unique to different scenarios. This highlights the practical value of our method for real-world applications.




Abstract:Process models are frequently used in software engineering to describe business requirements, guide software testing and control system improvement. However, traditional process modeling methods often require the participation of numerous experts, which is expensive and time-consuming. Therefore, the exploration of a more efficient and cost-effective automated modeling method has emerged as a focal point in current research. This article explores a framework for automatically generating process models with multi-agent orchestration (MAO), aiming to enhance the efficiency of process modeling and offer valuable insights for domain experts. Our framework MAO leverages large language models as the cornerstone for multi-agent, employing an innovative prompt strategy to ensure efficient collaboration among multi-agent. Specifically, 1) generation. The first phase of MAO is to generate a slightly rough process model from the text description; 2) refinement. The agents would continuously refine the initial process model through multiple rounds of dialogue; 3) reviewing. Large language models are prone to hallucination phenomena among multi-turn dialogues, so the agents need to review and repair semantic hallucinations in process models; 4) testing. The representation of process models is diverse. Consequently, the agents utilize external tools to test whether the generated process model contains format errors, namely format hallucinations, and then adjust the process model to conform to the output paradigm. The experiments demonstrate that the process models generated by our framework outperform existing methods and surpass manual modeling by 89%, 61%, 52%, and 75% on four different datasets, respectively.