Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ben Wiesel

ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

Mar 28, 2026

Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, Shafiq Abedin, Amit Alfassy, Eli Schwartz, Daniel Caraballo, Yagmur Gizem Cinar(+17 more)

Abstract:Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language -- a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. Moreover, a rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. The dataset is publicly available at https://huggingface.co/datasets/ibm-granite/ChartNet

* Accepted at CVPR 2026

Via

Access Paper or Ask Questions

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Mar 19, 2026

Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz

Abstract:Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long <think> traces overshadow short but task-critical <answer> segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the <think> segment, SCALe-SFT gradually shifts the focus from <think> to <answer> throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

Via

Access Paper or Ask Questions

Space-Time Encoded Modulation for High-Fidelity Diffuse Optical Imaging

Apr 04, 2025

Ben Wiesel, Shlomi Arnon

Figure 1 for Space-Time Encoded Modulation for High-Fidelity Diffuse Optical Imaging

Figure 2 for Space-Time Encoded Modulation for High-Fidelity Diffuse Optical Imaging

Figure 3 for Space-Time Encoded Modulation for High-Fidelity Diffuse Optical Imaging

Figure 4 for Space-Time Encoded Modulation for High-Fidelity Diffuse Optical Imaging

Abstract:Diffuse optical imaging (DOI) offers valuable insights into scattering mediums, but the quest for high-resolution imaging often requires dense sampling strategies, leading to higher imaging errors and lengthy acquisition times. This work introduces Space-Time Encoded Modulation (STEM), a novel light modulation scheme enabling low-noise, high-resolution imaging with single-pixel detectors. In STEM, a laser illuminates the sample, and the transmitted light is detected using a single pixel detector. The detected image is partitioned into a two-dimensional array of sub-images, each encoded with a unique quasi-orthogonal code. These coded sub-images represent light transmission at specific locations along the sample boundary. A single-pixel detector then measures their combined transmission. By virtue of their quasi-orthogonality, the relative strength of each sub-image can be measured, enabling image formation. In this paper, we present a comprehensive mathematical description and experimental validation of the STEM method. Compared to traditional raster scanning, STEM significantly enhances imaging quality, reducing imaging errors by up to 60% and yielding a 3.5-fold increase in reconstruction contrast.

Via

Access Paper or Ask Questions

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Oct 10, 2024

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov

Figure 1 for ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Figure 2 for ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Figure 3 for ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Figure 4 for ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Abstract:Recent advancements in LLM-based web agents have introduced novel architectures and benchmarks showcasing progress in autonomous web navigation and interaction. However, most existing benchmarks prioritize effectiveness and accuracy, overlooking crucial factors like safety and trustworthiness which are essential for deploying web agents in enterprise settings. The risks of unsafe web agent behavior, such as accidentally deleting user accounts or performing unintended actions in critical business operations, pose significant barriers to widespread adoption. In this paper, we present ST-WebAgentBench, a new online benchmark specifically designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. This benchmark is grounded in a detailed framework that defines safe and trustworthy (ST) agent behavior, outlines how ST policies should be structured and introduces the Completion under Policies metric to assess agent performance. Our evaluation reveals that current SOTA agents struggle with policy adherence and cannot yet be relied upon for critical business applications. Additionally, we propose architectural principles aimed at improving policy awareness and compliance in web agents. We open-source this benchmark and invite the community to contribute, with the goal of fostering a new generation of safer, more trustworthy AI agents. All code, data, environment reproduction resources, and video demonstrations are available at https://sites.google.com/view/st-webagentbench/home.

Via

Access Paper or Ask Questions