Abstract:This paper introduces a comprehensive end-to-end pipeline for Optical Character Recognition (OCR) on Urdu newspapers. In our approach, we address the unique challenges of complex multi-column layouts, low-resolution archival scans, and diverse font styles. Our process decomposes the OCR task into four key modules: (1) article segmentation, (2) image super-resolution, (3) column segmentation, and (4) text recognition. For article segmentation, we fine-tune and evaluate YOLOv11x to identify and separate individual articles from cluttered layouts. Our model achieves a precision of 0.963 and mAP@50 of 0.975. For super-resolution, we fine-tune and benchmark the SwinIR model (reaching 32.71 dB PSNR) to enhance the quality of degraded newspaper scans. To do our column segmentation, we use YOLOv11x to separate columns in text to further enhance performance - this model reaches a precision of 0.970 and mAP@50 of 0.975. In the text recognition stage, we benchmark a range of LLMs from different families, including Gemini, GPT, Llama, and Claude. The lowest WER of 0.133 is achieved by Gemini-2.5-Pro.
Abstract:This paper presents and evaluates multi-agent workflows for synthetic Preference Optimization (PO) dataset generation. PO dataset generation requires two modules: (1) response evaluation, and (2) response generation. In the response evaluation module, the responses from Large Language Models (LLMs) are evaluated and ranked - a task typically carried out by human annotators that we automate using LLMs. We assess the response evaluation module in a 2 step process. In step 1, we assess LLMs as evaluators using three distinct prompting strategies. In step 2, we apply the winning prompting strategy to compare the performance of LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. In each step, we use inter-rater agreement using Cohen's Kappa between human annotators and LLMs. For the response generation module, we compare different configurations for the LLM Feedback Loop using the identified LLM evaluator configuration. We use the win rate (the fraction of times a generation framework is selected as the best by an LLM evaluator) to determine the best multi-agent configuration for generation. After identifying the best configurations for both modules, we use models from the GPT, Gemma, and Llama families to generate our PO datasets using the above pipeline. We generate two types of PO datasets, one to improve the generation capabilities of individual LLM and the other to improve the multi-agent workflow. Our evaluation shows that GPT-4o-as-a-Judge is more consistent across datasets when the candidate responses do not include responses from the GPT family. Additionally, we find that the LLM Feedback Loop, with Llama as the generator and Gemma as the reviewer, achieves a notable 71.8% and 73.8% win rate over single-agent Llama and Gemma, respectively.
Abstract:This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA.