Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dheeraj Mekala

The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes

Jan 15, 2026

Aaron Adcock, Aayushi Srivastava, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pande, Abhinav Pandey, Abhinav Sharma, Abhishek Kadian, Abhishek Kumawat, Adam Kelsey(+1295 more)

Abstract:This document consolidates publicly reported technical details about Metas Llama 4 model family. It summarizes (i) released variants (Scout and Maverick) and the broader herd context including the previewed Behemoth teacher model, (ii) architectural characteristics beyond a high-level MoE description covering routed/shared-expert structure, early-fusion multimodality, and long-context design elements reported for Scout (iRoPE and length generalization strategies), (iii) training disclosures spanning pre-training, mid-training for long-context extension, and post-training methodology (lightweight SFT, online RL, and lightweight DPO) as described in release materials, (iv) developer-reported benchmark results for both base and instruction-tuned checkpoints, and (v) practical deployment constraints observed across major serving environments, including provider-specific context limits and quantization packaging. The manuscript also summarizes licensing obligations relevant to redistribution and derivative naming, and reviews publicly described safeguards and evaluation practices. The goal is to provide a compact technical reference for researchers and practitioners who need precise, source-backed facts about Llama 4.

* 15 pages

Via

Access Paper or Ask Questions

When is the consistent prediction likely to be a correct prediction?

Jul 08, 2024

Alex Nguyen, Dheeraj Mekala, Chengyu Dong, Jingbo Shang

Figure 1 for When is the consistent prediction likely to be a correct prediction?

Figure 2 for When is the consistent prediction likely to be a correct prediction?

Figure 3 for When is the consistent prediction likely to be a correct prediction?

Figure 4 for When is the consistent prediction likely to be a correct prediction?

Abstract:Self-consistency (Wang et al., 2023) suggests that the most consistent answer obtained through large language models (LLMs) is more likely to be correct. In this paper, we challenge this argument and propose a nuanced correction. Our observations indicate that consistent answers derived through more computation i.e. longer reasoning texts, rather than simply the most consistent answer across all outputs, are more likely to be correct. This is predominantly because we demonstrate that LLMs can autonomously produce chain-of-thought (CoT) style reasoning with no custom prompts merely while generating longer responses, which lead to consistent predictions that are more accurate. In the zero-shot setting, by sampling Mixtral-8x7B model multiple times and considering longer responses, we achieve 86% of its self-consistency performance obtained through zero-shot CoT prompting on the GSM8K and MultiArith datasets. Finally, we demonstrate that the probability of LLMs generating a longer response is quite low, highlighting the need for decoding strategies conditioned on output length.

Via

Access Paper or Ask Questions

DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering

Mar 30, 2024

Alex Nguyen, Zilong Wang, Jingbo Shang, Dheeraj Mekala

Figure 1 for DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering

Figure 2 for DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering

Figure 3 for DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering

Figure 4 for DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering

Abstract:The application of natural language processing models to PDF documents is pivotal for various business applications yet the challenge of training models for this purpose persists in businesses due to specific hurdles. These include the complexity of working with PDF formats that necessitate parsing text and layout information for curating training data and the lack of privacy-preserving annotation tools. This paper introduces DOCMASTER, a unified platform designed for annotating PDF documents, model training, and inference, tailored to document question-answering. The annotation interface enables users to input questions and highlight text spans within the PDF file as answers, saving layout information and text spans accordingly. Furthermore, DOCMASTER supports both state-of-the-art layout-aware and text models for comprehensive training purposes. Importantly, as annotations, training, and inference occur on-device, it also safeguards privacy. The platform has been instrumental in driving several research prototypes concerning document analysis such as the AI assistant utilized by University of California San Diego's (UCSD) International Services and Engagement Office (ISEO) for processing a substantial volume of PDF documents.

Via

Access Paper or Ask Questions

TOOLVERIFIER: Generalization to New Tools via Self-Verification

Feb 21, 2024

Dheeraj Mekala, Jason Weston, Jack Lanchantin, Roberta Raileanu, Maria Lomeli, Jingbo Shang, Jane Dwivedi-Yu

Figure 1 for TOOLVERIFIER: Generalization to New Tools via Self-Verification

Figure 2 for TOOLVERIFIER: Generalization to New Tools via Self-Verification

Figure 3 for TOOLVERIFIER: Generalization to New Tools via Self-Verification

Figure 4 for TOOLVERIFIER: Generalization to New Tools via Self-Verification

Abstract:Teaching language models to use tools is an important milestone towards building general assistants, but remains an open problem. While there has been significant progress on learning to use specific tools via fine-tuning, language models still struggle with learning how to robustly use new tools from only a few demonstrations. In this work we introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during (1) tool selection; and (2) parameter generation. We construct synthetic, high-quality, self-generated data for this goal using Llama-2 70B, which we intend to release publicly. Extensive experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines, even in scenarios where the distinctions between candidate tools are finely nuanced.

Via

Access Paper or Ask Questions

MORL-Prompt: An Empirical Analysis of Multi-Objective Reinforcement Learning for Discrete Prompt Optimization

Feb 18, 2024

Yasaman Jafari, Dheeraj Mekala, Rose Yu, Taylor Berg-Kirkpatrick

Abstract:RL-based techniques can be used to search for prompts that when fed into a target language model maximize a set of user-specified reward functions. However, in many target applications, the natural reward functions are in tension with one another -- for example, content preservation vs. style matching in style transfer tasks. Current techniques focus on maximizing the average of reward functions, which does not necessarily lead to prompts that achieve balance across rewards -- an issue that has been well-studied in the multi-objective and robust optimization literature. In this paper, we adapt several techniques for multi-objective optimization to RL-based discrete prompt optimization -- two that consider volume of the Pareto reward surface, and another that chooses an update direction that benefits all rewards simultaneously. We conduct an empirical analysis of these methods on two NLP tasks: style transfer and machine translation, each using three competing reward functions. Our experiments demonstrate that multi-objective methods that directly optimize volume perform better and achieve a better balance of all rewards than those that attempt to find monotonic update directions.

Via

Access Paper or Ask Questions

Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

Feb 16, 2024

Dheeraj Mekala, Alex Nguyen, Jingbo Shang

Figure 1 for Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

Figure 2 for Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

Figure 3 for Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

Figure 4 for Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

Abstract:Instruction-tuning language models has become a crucial step in aligning them for general use. Typically, this process involves extensive training on large datasets, incurring high training costs. In this paper, we introduce a novel training data selection based on the learning percentage of the samples. We assert that current language models possess the capability to autonomously select high-quality training data, leading to comparable or improved performance compared to training on the entire dataset. Our experiments span different-sized models, revealing that this characteristic holds for models ranging from 1B (small) to 13B (large) in size. Moreover, we demonstrate an interesting finding that the data hardness transfers across model sizes, and a smaller 350M model can effectively curate high-quality training data with hard samples for a larger 13B model, resulting in an equally or superior instruction-tuned model compared to training on the complete dataset. Utilizing open-sourced OPT and Llama-2 models up to 13B in size, two publicly available instruction-tuning training datasets and evaluated by both automatic metrics & humans, our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.

Via

Access Paper or Ask Questions

DAIL: Data Augmentation for In-Context Learning via Self-Paraphrase

Nov 06, 2023

Dawei Li, Yaxuan Li, Dheeraj Mekala, Shuyao Li, Yulin wang, Xueqi Wang, William Hogan, Jingbo Shang

Figure 1 for DAIL: Data Augmentation for In-Context Learning via Self-Paraphrase

Figure 2 for DAIL: Data Augmentation for In-Context Learning via Self-Paraphrase

Figure 3 for DAIL: Data Augmentation for In-Context Learning via Self-Paraphrase

Figure 4 for DAIL: Data Augmentation for In-Context Learning via Self-Paraphrase

Abstract:In-Context Learning (ICL) combined with pre-trained large language models has achieved promising results on various NLP tasks. However, ICL requires high-quality annotated demonstrations which might not be available in real-world scenarios. To overcome this limitation, we propose \textbf{D}ata \textbf{A}ugmentation for \textbf{I}n-Context \textbf{L}earning (\textbf{DAIL}). DAIL leverages the intuition that large language models are more familiar with the content generated by themselves. It first utilizes the language model to generate paraphrases of the test sample and employs majority voting to determine the final result based on individual predictions. Our extensive empirical evaluation shows that DAIL outperforms the standard ICL method and other ensemble-based methods in the low-resource scenario. Additionally, we explore the use of voting consistency as a confidence score of the model when the logits of predictions are inaccessible. We believe our work will stimulate further research on ICL in low-resource settings.

* Course project for DSC 253 (Advanced Data-Driven Text Mining) at UCSD

Via

Access Paper or Ask Questions

SELFOOD: Self-Supervised Out-Of-Distribution Detection via Learning to Rank

May 24, 2023

Dheeraj Mekala, Adithya Samavedhi, Chengyu Dong, Jingbo Shang

Figure 1 for SELFOOD: Self-Supervised Out-Of-Distribution Detection via Learning to Rank

Figure 2 for SELFOOD: Self-Supervised Out-Of-Distribution Detection via Learning to Rank

Figure 3 for SELFOOD: Self-Supervised Out-Of-Distribution Detection via Learning to Rank

Figure 4 for SELFOOD: Self-Supervised Out-Of-Distribution Detection via Learning to Rank

Abstract:Deep neural classifiers trained with cross-entropy loss (CE loss) often suffer from poor calibration, necessitating the task of out-of-distribution (OOD) detection. Traditional supervised OOD detection methods require expensive manual annotation of in-distribution and OOD samples. To address the annotation bottleneck, we introduce SELFOOD, a self-supervised OOD detection method that requires only in-distribution samples as supervision. We cast OOD detection as an inter-document intra-label (IDIL) ranking problem and train the classifier with our pairwise ranking loss, referred to as IDIL loss. Specifically, given a set of in-distribution documents and their labels, for each label, we train the classifier to rank the softmax scores of documents belonging to that label to be higher than the scores of documents that belong to other labels. Unlike CE loss, our IDIL loss function reaches zero when the desired confidence ranking is achieved and gradients are backpropagated to decrease probabilities associated with incorrect labels rather than continuously increasing the probability of the correct label. Extensive experiments with several classifiers on multiple classification datasets demonstrate the effectiveness of our method in both coarse- and fine-grained settings.

Via

Access Paper or Ask Questions

A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches

May 22, 2023

Zihan Wang, Tianle Wang, Dheeraj Mekala, Jingbo Shang

Figure 1 for A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches

Figure 2 for A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches

Figure 3 for A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches

Figure 4 for A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches

Abstract:Etremely Weakly Supervised Text Classification (XWS-TC) refers to text classification based on minimal high-level human guidance, such as a few label-indicative seed words or classification instructions. There are two mainstream approaches for XWS-TC, however, never being rigorously compared: (1) training classifiers based on pseudo-labels generated by (softly) matching seed words (SEED) and (2) prompting (and calibrating) language models using classification instruction (and raw texts) to decode label words (PROMPT). This paper presents the first XWS-TC benchmark to compare the two approaches on fair grounds, where the datasets, supervisions, and hyperparameter choices are standardized across methods. Our benchmarking results suggest that (1) Both SEED and PROMPT approaches are competitive and there is no clear winner; (2) SEED is empirically more tolerant than PROMPT to human guidance (e.g., seed words, classification instructions, and label words) changes; (3) SEED is empirically more selective than PROMPT to the pre-trained language models; (4) Recent SEED and PROMPT methods have close connections and a clustering post-processing step based on raw in-domain texts is a strong performance booster to both. We hope this benchmark serves as a guideline in selecting XWS-TC methods in different scenarios and stimulate interest in developing guidance- and model-robust XWS-TC methods. We release the repo at https://github.com/ZihanWangKi/x-TC.

* ACL 2023 Findings

Via

Access Paper or Ask Questions

ZEROTOP: Zero-Shot Task-Oriented Semantic Parsing using Large Language Models

Dec 21, 2022

Dheeraj Mekala, Jason Wolfe, Subhro Roy

Figure 1 for ZEROTOP: Zero-Shot Task-Oriented Semantic Parsing using Large Language Models

Figure 2 for ZEROTOP: Zero-Shot Task-Oriented Semantic Parsing using Large Language Models

Figure 3 for ZEROTOP: Zero-Shot Task-Oriented Semantic Parsing using Large Language Models

Figure 4 for ZEROTOP: Zero-Shot Task-Oriented Semantic Parsing using Large Language Models

Abstract:We explore the use of large language models (LLMs) for zero-shot semantic parsing. Semantic parsing involves mapping natural language utterances to task-specific meaning representations. Language models are generally trained on the publicly available text and code and cannot be expected to directly generalize to domain-specific parsing tasks in a zero-shot setting. In this work, we propose ZEROTOP, a zero-shot task-oriented parsing method that decomposes a semantic parsing problem into a set of abstractive and extractive question-answering (QA) problems, enabling us to leverage the ability of LLMs to zero-shot answer reading comprehension questions. For each utterance, we prompt the LLM with questions corresponding to its top-level intent and a set of slots and use the LLM generations to construct the target meaning representation. We observe that current LLMs fail to detect unanswerable questions; and as a result, cannot handle questions corresponding to missing slots. To address this problem, we fine-tune a language model on public QA datasets using synthetic negative samples. Experimental results show that our QA-based decomposition paired with the fine-tuned LLM can correctly parse ~16% of utterances in the MTOP dataset without requiring any annotated data.

Via

Access Paper or Ask Questions