Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matvey Skripkin

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Jun 17, 2026

Nikita Kachaev, Andrey Moskalenko, Matvey Skripkin, Nikita Kurlaev, Daria Pugacheva, Albina Burlova, Mikhail Kolosov, Denis Shepelev, Andrey Kuznetsov, Elena Tutubalina(+3 more)

Abstract:Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.

* Project page: https://tttonyalpha.github.io/act2answer/

Via

Access Paper or Ask Questions

Simple Vision-Language Math Reasoning via Rendered Text

Nov 12, 2025

Matvey Skripkin, Elizaveta Goncharova, Andrey Kuznetsov

Figure 1 for Simple Vision-Language Math Reasoning via Rendered Text

Figure 2 for Simple Vision-Language Math Reasoning via Rendered Text

Figure 3 for Simple Vision-Language Math Reasoning via Rendered Text

Figure 4 for Simple Vision-Language Math Reasoning via Rendered Text

Abstract:We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.

Via

Access Paper or Ask Questions

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

Jun 05, 2025

Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, Ivan Oseledets

Abstract:Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 8 samples per question and 4 training epochs, RLSC improves accuracy by +20.10% on AIME2024, +49.40% on MATH500, and +52.50% on AMC23. RLSC offers a simple, scalable post-training method for reasoning models with minimal supervision.

Via

Access Paper or Ask Questions

MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Feb 21, 2025

Matvey Skripkin, Elizaveta Goncharova, Dmitrii Tarasov, Andrey Kuznetsov

Figure 1 for MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Figure 2 for MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Figure 3 for MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Figure 4 for MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Abstract:Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision encoder, there is a great variability of specialized encoders that can boost model's performance in distinct domains. In this work, we propose MOVE (Mixture of Vision Encoders) a simple yet effective approach to leverage multiple pre-trained encoders for specialized multimodal tasks. MOVE automatically routes inputs to the most appropriate encoder among candidates such as Unichat, InternViT, and Texify, thereby enhancing performance across a diverse set of benchmarks, including ChartQA, MMBench, and MMMU. Experimental results demonstrate that MOVE achieves competitive accuracy without incurring the complexities of image slicing for high-resolution images.

* 10 pages, 6 figures, 4 tables

Via

Access Paper or Ask Questions

UniDet3D: Multi-dataset Indoor 3D Object Detection

Sep 06, 2024

Maksim Kolodiazhnyi, Anna Vorontsova, Matvey Skripkin, Danila Rukhovich, Anton Konushin

Figure 1 for UniDet3D: Multi-dataset Indoor 3D Object Detection

Figure 2 for UniDet3D: Multi-dataset Indoor 3D Object Detection

Figure 3 for UniDet3D: Multi-dataset Indoor 3D Object Detection

Figure 4 for UniDet3D: Multi-dataset Indoor 3D Object Detection

Abstract:Growing customer demand for smart solutions in robotics and augmented reality has attracted considerable attention to 3D object detection from point clouds. Yet, existing indoor datasets taken individually are too small and insufficiently diverse to train a powerful and general 3D object detection model. In the meantime, more general approaches utilizing foundation models are still inferior in quality to those based on supervised training for a specific task. In this work, we propose \ours{}, a simple yet effective 3D object detection model, which is trained on a mixture of indoor datasets and is capable of working in various indoor environments. By unifying different label spaces, \ours{} enables learning a strong representation across multiple datasets through a supervised joint training scheme. The proposed network architecture is built upon a vanilla transformer encoder, making it easy to run, customize and extend the prediction pipeline for practical use. Extensive experiments demonstrate that \ours{} obtains significant gains over existing 3D object detection methods in 6 indoor benchmarks: ScanNet (+1.1 mAP50), ARKitScenes (+19.4 mAP25), S3DIS (+9.1 mAP50), MultiScan (+9.3 mAP50), 3RScan (+3.2 mAP50), and ScanNet++ (+2.7 mAP50). Code is available at https://github.com/filapro/unidet3d .

Via

Access Paper or Ask Questions

OmniFusion Technical Report

Apr 09, 2024

Elizaveta Goncharova, Anton Razzhigaev, Matvey Mikhalchuk, Maxim Kurkin, Irina Abdullaeva, Matvey Skripkin, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov

Figure 1 for OmniFusion Technical Report

Figure 2 for OmniFusion Technical Report

Figure 3 for OmniFusion Technical Report

Figure 4 for OmniFusion Technical Report

Abstract:Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.

* 17 pages, 4 figures, 9 tables, 2 appendices

Via

Access Paper or Ask Questions