Abstract:It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.




Abstract:Robot Operating System (ROS) has brought the excellent potential for automation in various fields involving production tasks, productivity enhancement, and the simplification of human operations. However, ROS highly relies on communication but lacks secure data sharing mechanisms. Securing confidential data exchange between multi-robots presents significant challenges in multi-robot interactions. In this paper, we introduce AuthROS, a secure and convenient authorization framework for ROS nodes with absolute security and high availability based on a private Ethereum network and SM algorithms. To our best knowledge, AuthROS is the first secure data-sharing framework for robots loaded with ROS. This framework can meet the requirements for immutability and security of confidential data exchanged between ROS nodes. In addition, an authority-granting and identity-verification mechanism are proposed to execute atomically to ensure trustworthy data exchange without third-party. Both an SM2 key exchange and an SM4 plaintext encryption mechanism are proposed for data transmission security. A data digest uploading scheme is also implemented to improve the efficiency of data querying and uploading on the Ethereum network. Experimental results demonstrate that it can generate a digest from 800KB encrypted data in 6.34ms. Through security analysis, AuthROS achieves secure data exchange, data operations detection, and Node Forging attack protection.