Picture for Haq Nawaz Malik

Haq Nawaz Malik

synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier

Add code
Jan 22, 2026
Viaarxiv icon

600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script

Add code
Jan 03, 2026
Viaarxiv icon

ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining

Add code
Jan 03, 2026
Viaarxiv icon