Abstract:Presentation slides are a primary medium for data-driven reporting, yet keeping complex, analytics-style decks up to date remains labor-intensive. Existing automation methods mostly follow fixed template filling and cannot support dynamic updates for diverse, user-authored slide decks. We therefore define "Dynamic Slide Update via Natural Language Instructions on User-provided Templates" and introduce DynaSlide, a large-scale benchmark with 20,036 real-world instruction-execution triples (source slide, user instruction, target slide) grounded in a shared external database and built from business reporting slides under bring-your-own-template (BYO-template) conditions. To tackle this task, we propose SlideAgent, an agent-based framework that combines multimodal slide parsing, natural language instruction grounding, and tool-augmented reasoning for tables, charts, and textual conclusions. SlideAgent updates content while preserving layout and style, providing a strong reference baseline on DynaSlide. We further design end-to-end and component-level evaluation protocols that reveal key challenges and opportunities for future research. The dataset and code are available at https://github.com/XiaoZhou2024/SlideAgent.
Abstract:Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims to learn modality-invariant features from unlabeled cross-modality datasets and reduce the inter-modality gap. However, the existing methods lack cross-modality clustering or excessively pursue cluster-level association, which makes it difficult to perform reliable modality-invariant features learning. To deal with this issue, we propose a Extended Cross-Modality United Learning (ECUL) framework, incorporating Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating Strategy (TSMem) modules. Specifically, we design ECUL to naturally integrates intra-modality clustering, inter-modality clustering and inter-modality instance selection, establishing compact and accurate cross-modality associations while reducing the introduction of noisy labels. Moreover, EMCC captures and filters the neighborhood relationships by extending the encoding vector, which further promotes the learning of modality-invariant and camera-invariant knowledge in terms of clustering algorithm. Finally, TSMem provides accurate and generalized proxy points for contrastive learning by updating the memory in stages. Extensive experiments results on SYSU-MM01 and RegDB datasets demonstrate that the proposed ECUL shows promising performance and even outperforms certain supervised methods.