Abstract:Vision-language (VL) models have shown transformative potential across various critical domains due to their capability to comprehend multi-modal information. However, their performance frequently degrades under distribution shifts, making it crucial to assess and improve robustness against real-world data corruption encountered in practical applications. While advancements in VL benchmark datasets and data augmentation (DA) have contributed to robustness evaluation and improvement, there remain challenges due to a lack of in-depth comprehension of model behavior as well as the need for expertise and iterative efforts to explore data patterns. Given the achievement of visualization in explaining complex models and exploring large-scale data, understanding the impact of various data corruption on VL models aligns naturally with a visual analytics approach. To address these challenges, we introduce VisMoDAl, a visual analytics framework designed to evaluate VL model robustness against various corruption types and identify underperformed samples to guide the development of effective DA strategies. Grounded in the literature review and expert discussions, VisMoDAl supports multi-level analysis, ranging from examining performance under specific corruptions to task-driven inspection of model behavior and corresponding data slice. Unlike conventional works, VisMoDAl enables users to reason about the effects of corruption on VL models, facilitating both model behavior understanding and DA strategy formulation. The utility of our system is demonstrated through case studies and quantitative evaluations focused on corruption robustness in the image captioning task.
Abstract:With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between "thinking" (high reasoning) and "non-thinking" (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.