Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Madison Lee

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

Mar 16, 2026

Udi Barzelay, Ophir Azulai, Inbar Shapira, Idan Friedman, Foad Abo Dahood, Madison Lee, Abraham Daniels

Abstract:We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy -- a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models <=4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance -- not extraction capability -- is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.

* 9 pages, 4 figures, 4 tables, plus 12-page supplementary. Dataset: https://huggingface.co/datasets/ibm-research/VAREX Code: https://github.com/udibarzi/varex-bench

Via

Access Paper or Ask Questions

Initial Risk Probing and Feasibility Testing of Glow: a Generative AI-Powered Dialectical Behavior Therapy Skills Coach for Substance Use Recovery and HIV Prevention

Feb 08, 2026

Liying Wang, Madison Lee, Yunzhang Jiang, Steven Chen, Kewei Sha, Yunhe Feng, Frank Wong, Lisa Hightow-Weidman, Weichao Yuwen

Abstract:Background: HIV and substance use represent interacting epidemics with shared psychological drivers - impulsivity and maladaptive coping. Dialectical behavior therapy (DBT) targets these mechanisms but faces scalability challenges. Generative artificial intelligence (GenAI) offers potential for delivering personalized DBT coaching at scale, yet rapid development has outpaced safety infrastructure. Methods: We developed Glow, a GenAI-powered DBT skills coach delivering chain and solution analysis for individuals at risk for HIV and substance use. In partnership with a Los Angeles community health organization, we conducted usability testing with clinical staff (n=6) and individuals with lived experience (n=28). Using the Helpful, Honest, and Harmless (HHH) framework, we employed user-driven adversarial testing wherein participants identified target behaviors and generated contextually realistic risk probes. We evaluated safety performance across 37 risk probe interactions. Results: Glow appropriately handled 73% of risk probes, but performance varied by agent. The solution analysis agent demonstrated 90% appropriate handling versus 44% for the chain analysis agent. Safety failures clustered around encouraging substance use and normalizing harmful behaviors. The chain analysis agent fell into an "empathy trap," providing validation that reinforced maladaptive beliefs. Additionally, 27 instances of DBT skill misinformation were identified. Conclusions: This study provides the first systematic safety evaluation of GenAI-delivered DBT coaching for HIV and substance use risk reduction. Findings reveal vulnerabilities requiring mitigation before clinical trials. The HHH framework and user-driven adversarial testing offer replicable methods for evaluating GenAI mental health interventions.

Via

Access Paper or Ask Questions

Automatic speech recognition for launch control center communication using recurrent neural networks with data augmentation and custom language model

Apr 24, 2018

Kyongsik Yun, Joseph Osborne, Madison Lee, Thomas Lu, Edward Chow

Abstract:Transcribing voice communications in NASA's launch control center is important for information utilization. However, automatic speech recognition in this environment is particularly challenging due to the lack of training data, unfamiliar words in acronyms, multiple different speakers and accents, and conversational characteristics of speaking. We used bidirectional deep recurrent neural networks to train and test speech recognition performance. We showed that data augmentation and custom language models can improve speech recognition accuracy. Transcribing communications from the launch control center will help the machine analyze information and accelerate knowledge generation.

* SPIE 2018

Via

Access Paper or Ask Questions