Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abhijit Ramesh

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

May 20, 2026

Reese Levine, Rithik Sharma, Nikhil Jain, Abhijit Ramesh, Zheyuan Chen, Neha Abbas, James Contini, Tyler Sorensen

Abstract:Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama$.$cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama$.$cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.

* 19 pages, 11 figures, 5 tables

Via

Access Paper or Ask Questions

Automatic Endoscopic Ultrasound Station Recognition with Limited Data

Sep 22, 2023

Abhijit Ramesh, Anantha Nandanan, Nikhil Boggavarapu, Priya Nair MD, Gilad Gressel

Abstract:Pancreatic cancer is a lethal form of cancer that significantly contributes to cancer-related deaths worldwide. Early detection is essential to improve patient prognosis and survival rates. Despite advances in medical imaging techniques, pancreatic cancer remains a challenging disease to detect. Endoscopic ultrasound (EUS) is the most effective diagnostic tool for detecting pancreatic cancer. However, it requires expert interpretation of complex ultrasound images to complete a reliable patient scan. To obtain complete imaging of the pancreas, practitioners must learn to guide the endoscope into multiple "EUS stations" (anatomical locations), which provide different views of the pancreas. This is a difficult skill to learn, involving over 225 proctored procedures with the support of an experienced doctor. We build an AI-assisted tool that utilizes deep learning techniques to identify these stations of the stomach in real time during EUS procedures. This computer-assisted diagnostic (CAD) will help train doctors more efficiently. Historically, the challenge faced in developing such a tool has been the amount of retrospective labeling required by trained clinicians. To solve this, we developed an open-source user-friendly labeling web app that streamlines the process of annotating stations during the EUS procedure with minimal effort from the clinicians. Our research shows that employing only 43 procedures with no hyperparameter fine-tuning obtained a balanced accuracy of 90%, comparable to the current state of the art. In addition, we employ Grad-CAM, a visualization technology that provides clinicians with interpretable and explainable visualizations.

Via

Access Paper or Ask Questions