Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Muhammad Umer Ramzan

Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

Mar 10, 2026

Ali Zia, Muhammad Umer Ramzan, Usman Ali, Muhammad Faheem, Abdelwahed Khamis, Shahnawaz Qureshi

Abstract:Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.

Via

Access Paper or Ask Questions

Geometry-Aware Semantic Reasoning for Training Free Video Anomaly Detection

Mar 10, 2026

Ali Zia, Usman Ali, Muhammad Umer Ramzan, Hamza Abid, Abdul Rehman, Wei Xiang

Abstract:Training-free video anomaly detection (VAD) has recently emerged as a scalable alternative to supervised approaches, yet existing methods largely rely on static prompting and geometry-agnostic feature fusion. As a result, anomaly inference is often reduced to shallow similarity matching over Euclidean embeddings, leading to unstable predictions and limited interpretability, especially in complex or hierarchically structured scenes. We introduce MM-VAD, a geometry-aware semantic reasoning framework for training free VAD that reframes anomaly detection as adaptive test-time inference rather than fixed feature comparison. Our approach projects caption-derived scene representations into hyperbolic space to better preserve hierarchical structure and performs anomaly assessment through an adaptive question answering process over a frozen large language model. A lightweight, learnable prompt is optimised at test time using an unsupervised confidence-sparsity objective, enabling context-specific calibration without updating any backbone parameters. To further ground semantic predictions in visual evidence, we incorporate a covariance-aware Mahalanobis refinement that stabilises cross-modal alignment. Across four benchmarks, MM-VAD consistently improves over prior training-free methods, achieving 90.03% AUC on XD-Violence and 83.24%, 96.95%, and 98.81% on UCF-Crime, ShanghaiTech, and UCSD Ped2, respectively. Our results demonstrate that geometry-aware representation and adaptive semantic calibration provide a principled and effective alternative to static Euclidean matching in training-free VAD.

Via

Access Paper or Ask Questions

Leveraging Deep Learning with Multi-Head Attention for Accurate Extraction of Medicine from Handwritten Prescriptions

Dec 24, 2024

Usman Ali, Sahil Ranmbail, Muhammad Nadeem, Hamid Ishfaq, Muhammad Umer Ramzan, Waqas Ali

Abstract:Extracting medication names from handwritten doctor prescriptions is challenging due to the wide variability in handwriting styles and prescription formats. This paper presents a robust method for extracting medicine names using a combination of Mask R-CNN and Transformer-based Optical Character Recognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A novel dataset, featuring diverse handwritten prescriptions from various regions of Pakistan, was utilized to fine-tune the model on different handwriting styles. The Mask R-CNN model segments the prescription images to focus on the medicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and Positional Embeddings, transcribes the isolated text. The transcribed text is then matched against a pre-existing database for accurate identification. The proposed approach achieved a character error rate (CER) of 1.4% on standard benchmarks, highlighting its potential as a reliable and efficient tool for automating medicine name extraction.

Via

Access Paper or Ask Questions

Gated-Attention Feature-Fusion Based Framework for Poverty Prediction

Nov 29, 2024

Muhammad Umer Ramzan, Wahab Khaddim, Muhammad Ehsan Rana, Usman Ali, Manohar Ali, Fiaz ul Hassan, Fatima Mehmood

Figure 1 for Gated-Attention Feature-Fusion Based Framework for Poverty Prediction

Figure 2 for Gated-Attention Feature-Fusion Based Framework for Poverty Prediction

Abstract:This research paper addresses the significant challenge of accurately estimating poverty levels using deep learning, particularly in developing regions where traditional methods like household surveys are often costly, infrequent, and quickly become outdated. To address these issues, we propose a state-of-the-art Convolutional Neural Network (CNN) architecture, extending the ResNet50 model by incorporating a Gated-Attention Feature-Fusion Module (GAFM). Our architecture is designed to improve the model's ability to capture and combine both global and local features from satellite images, leading to more accurate poverty estimates. The model achieves a 75% R2 score, significantly outperforming existing leading methods in poverty mapping. This improvement is due to the model's capacity to focus on and refine the most relevant features, filtering out unnecessary data, which makes it a powerful tool for remote sensing and poverty estimation.

* The paper has accepted for publication at 5th International Conference on Data Engineering and Communication Technology (ICDECT)

Via

Access Paper or Ask Questions

Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement

Nov 28, 2024

Muhammad Umer Ramzan, Ali Zia, Abdelwahed Khamis, yman Elgharabawy, Ahmad Liaqat, Usman Ali

Figure 1 for Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement

Figure 2 for Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement

Figure 3 for Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement

Figure 4 for Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement

Abstract:This paper presents a novel deep-learning framework that significantly enhances the transformation of rudimentary face sketches into high-fidelity colour images. Employing a Convolutional Block Attention-based Auto-encoder Network (CA2N), our approach effectively captures and enhances critical facial features through a block attention mechanism within an encoder-decoder architecture. Subsequently, the framework utilises a noise-induced conditional Generative Adversarial Network (cGAN) process that allows the system to maintain high performance even on domains unseen during the training. These enhancements lead to considerable improvements in image realism and fidelity, with our model achieving superior performance metrics that outperform the best method by FID margin of 17, 23, and 38 on CelebAMask-HQ, CUHK, and CUFSF datasets; respectively. The model sets a new state-of-the-art in sketch-to-image generation, can generalize across sketch types, and offers a robust solution for applications such as criminal identification in law enforcement.

* Paper accepted for publication in 25th International Conference on Digital Image Computing: Techniques & Applications (DICTA) 2024

Via

Access Paper or Ask Questions

Enhancing Vehicle Entrance and Parking Management: Deep Learning Solutions for Efficiency and Security

Dec 05, 2023

Muhammad Umer Ramzan, Usman Ali, Syed Haider Abbas Naqvi, Zeeshan Aslam, Tehseen, Husnain Ali, Muhammad Faheem

Figure 1 for Enhancing Vehicle Entrance and Parking Management: Deep Learning Solutions for Efficiency and Security

Figure 2 for Enhancing Vehicle Entrance and Parking Management: Deep Learning Solutions for Efficiency and Security

Figure 3 for Enhancing Vehicle Entrance and Parking Management: Deep Learning Solutions for Efficiency and Security

Figure 4 for Enhancing Vehicle Entrance and Parking Management: Deep Learning Solutions for Efficiency and Security

Abstract:The auto-management of vehicle entrance and parking in any organization is a complex challenge encompassing record-keeping, efficiency, and security concerns. Manual methods for tracking vehicles and finding parking spaces are slow and a waste of time. To solve the problem of auto management of vehicle entrance and parking, we have utilized state-of-the-art deep learning models and automated the process of vehicle entrance and parking into any organization. To ensure security, our system integrated vehicle detection, license number plate verification, and face detection and recognition models to ensure that the person and vehicle are registered with the organization. We have trained multiple deep-learning models for vehicle detection, license number plate detection, face detection, and recognition, however, the YOLOv8n model outperformed all the other models. Furthermore, License plate recognition is facilitated by Google's Tesseract-OCR Engine. By integrating these technologies, the system offers efficient vehicle detection, precise identification, streamlined record keeping, and optimized parking slot allocation in buildings, thereby enhancing convenience, accuracy, and security. Future research opportunities lie in fine-tuning system performance for a wide range of real-world applications.

* Accepted for publication in the 25th International Multitopic Conference (INMIC) IEEE 2023, 6 Pages, 3 figures

Via

Access Paper or Ask Questions