Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Umapada Pal

Towards Robust Cross-Dataset Object Detection Generalization under Domain Specificity

Jan 14, 2026

Ritabrata Chakraborty, Hrishit Mitra, Shivakumara Palaiahnakote, Umapada Pal

Abstract:Object detectors often perform well in-distribution, yet degrade sharply on a different benchmark. We study cross-dataset object detection (CD-OD) through a lens of setting specificity. We group benchmarks into setting-agnostic datasets with diverse everyday scenes and setting-specific datasets tied to a narrow environment, and evaluate a standard detector family across all train--test pairs. This reveals a clear structure in CD-OD: transfer within the same setting type is relatively stable, while transfer across setting types drops substantially and is often asymmetric. The most severe breakdowns occur when transferring from specific sources to agnostic targets, and persist after open-label alignment, indicating that domain shift dominates in the hardest regimes. To disentangle domain shift from label mismatch, we compare closed-label transfer with an open-label protocol that maps predicted classes to the nearest target label using CLIP similarity. Open-label evaluation yields consistent but bounded gains, and many corrected cases correspond to semantic near-misses supported by the image evidence. Overall, we provide a principled characterization of CD-OD under setting specificity and practical guidance for evaluating detectors under distribution shift. Code will be released at \href{[https://github.com/Ritabrata04/cdod-icpr.git}{https://github.com/Ritabrata04/cdod-icpr}.

* 15 pages, 4 figures, 6 tables

Via

Access Paper or Ask Questions

TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

Sep 04, 2025

Ayan Banerjee, Josep Lladós, Umapada Pal, Anjan Dutta

Abstract:Text-to-story visualization is challenging due to the need for consistent interaction among multiple characters across frames. Existing methods struggle with character consistency, leading to artifact generation and inaccurate dialogue rendering, which results in disjointed storytelling. In response, we introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process, maintaining character consistency, and accurate dialogue assignment via postprocessing. Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues via in-context learning, followed by a bounded attention-based per-box mask technique to control character interactions and minimize artifacts. We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement. Dialogues are also rendered as bubbles and assigned to characters via CLIPSeg. Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.

Via

Access Paper or Ask Questions

Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics

Mar 31, 2025

Swarnava Bhattacharyya, Umapada Pal, Tapabrata Chakraborti

Figure 1 for Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics

Figure 2 for Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics

Figure 3 for Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics

Figure 4 for Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics

Abstract:Deep learning based diagnostic AI systems based on medical images are starting to provide similar performance as human experts. However these data hungry complex systems are inherently black boxes and therefore slow to be adopted for high risk applications like healthcare. This problem of lack of transparency is exacerbated in the case of recent large foundation models, which are trained in a self supervised manner on millions of data points to provide robust generalisation across a range of downstream tasks, but the embeddings generated from them happen through a process that is not interpretable, and hence not easily trustable for clinical applications. To address this timely issue, we deploy conformal analysis to quantify the predictive uncertainty of a vision transformer (ViT) based foundation model across patient demographics with respect to sex, age and ethnicity for the tasks of skin lesion classification using several public benchmark datasets. The significant advantage of this method is that conformal analysis is method independent and it not only provides a coverage guarantee at population level but also provides an uncertainty score for each individual. We used a model-agnostic dynamic F1-score-based sampling during model training, which helped to stabilize the class imbalance and we investigate the effects on uncertainty quantification (UQ) with or without this bias mitigation step. Thus we show how this can be used as a fairness metric to evaluate the robustness of the feature embeddings of the foundation model (Google DermFoundation) and thus advance the trustworthiness and fairness of clinical AI.

Via

Access Paper or Ask Questions

A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition

Mar 19, 2025

Ritabrata Chakraborty, Shivakumara Palaiahnakote, Umapada Pal, Cheng-Lin Liu

Figure 1 for A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition

Figure 2 for A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition

Figure 3 for A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition

Figure 4 for A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition

Abstract:Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene context.Candidate texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.

Via

Access Paper or Ask Questions

d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

Feb 19, 2025

Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein

Figure 1 for d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

Figure 2 for d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

Figure 3 for d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

Figure 4 for d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

Abstract:Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.

* Accepted in The International Conference on Pattern Recognition (ICPR) 2024

Via

Access Paper or Ask Questions

Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

Feb 19, 2025

Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein

Figure 1 for Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

Figure 2 for Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

Figure 3 for Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

Figure 4 for Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

Abstract:Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.

* 11 pages

Via

Access Paper or Ask Questions

Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

Oct 02, 2024

Arkadip Maitra, Shree Mitra, Siladittya Manna, Saumik Bhattacharya, Umapada Pal

Figure 1 for Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

Figure 2 for Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

Figure 3 for Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

Figure 4 for Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

Abstract:Self-supervised learning has developed rapidly over the last decade and has been applied in many areas of computer vision. Decorrelation-based self-supervised pretraining has shown great promise among non-contrastive algorithms, yielding performance at par with supervised and contrastive self-supervised baselines. In this work, we explore the decorrelation-based paradigm of self-supervised learning and apply the same to learning disentangled stroke features for writer identification. Here we propose a modified formulation of the decorrelation-based framework named SWIS which was proposed for signature verification by standardizing the features along each dimension on top of the existing framework. We show that the proposed framework outperforms the contemporary self-supervised learning framework on the writer identification benchmark and also outperforms several supervised methods as well. To the best of our knowledge, this work is the first of its kind to apply self-supervised learning for learning representations for writer verification tasks.

Via

Access Paper or Ask Questions

FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Aug 27, 2024

Alloy Das, Sanket Biswas, Umapada Pal, Josep Lladós, Saumik Bhattacharya

Figure 1 for FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Figure 2 for FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Figure 3 for FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Figure 4 for FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Abstract:The proliferation of scene text in both structured and unstructured environments presents significant challenges in optical character recognition (OCR), necessitating more efficient and robust text spotting solutions. This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder architecture, enhanced by a novel, faster self-attention unit, SAC2, to improve processing speeds while maintaining accuracy. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts, benchmarking against current state-of-the-art models. Our results indicate that FastTextSpotter not only achieves superior accuracy in detecting and recognizing multilingual scene text (English and Vietnamese) but also improves model efficiency, thereby setting new benchmarks in the field. This study underscores the potential of advanced transformer architectures in improving the adaptability and speed of text spotting applications in diverse real-world settings. The dataset, code, and pre-trained models have been released in our Github.

* Accepted in ICPR 2024

Via

Access Paper or Ask Questions

Correlation Weighted Prototype-based Self-Supervised One-Shot Segmentation of Medical Images

Aug 12, 2024

Siladittya Manna, Saumik Bhattacharya, Umapada Pal

Figure 1 for Correlation Weighted Prototype-based Self-Supervised One-Shot Segmentation of Medical Images

Figure 2 for Correlation Weighted Prototype-based Self-Supervised One-Shot Segmentation of Medical Images

Figure 3 for Correlation Weighted Prototype-based Self-Supervised One-Shot Segmentation of Medical Images

Figure 4 for Correlation Weighted Prototype-based Self-Supervised One-Shot Segmentation of Medical Images

Abstract:Medical image segmentation is one of the domains where sufficient annotated data is not available. This necessitates the application of low-data frameworks like few-shot learning. Contemporary prototype-based frameworks often do not account for the variation in features within the support and query images, giving rise to a large variance in prototype alignment. In this work, we adopt a prototype-based self-supervised one-way one-shot learning framework using pseudo-labels generated from superpixels to learn the semantic segmentation task itself. We use a correlation-based probability score to generate a dynamic prototype for each query pixel from the bag of prototypes obtained from the support feature map. This weighting scheme helps to give a higher weightage to contextually related prototypes. We also propose a quadrant masking strategy in the downstream segmentation task by utilizing prior domain information to discard unwanted false positives. We present extensive experimentations and evaluations on abdominal CT and MR datasets to show that the proposed simple but potent framework performs at par with the state-of-the-art methods.

* Accepted to ICPR 2024

Via

Access Paper or Ask Questions

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

May 29, 2024

Miguel A. Ferrer, Abhijit Das, Moises Diaz, Aythami Morales, Cristina Carmona-Duarte, Umapada Pal

Abstract:Script identification plays a vital role in applications that involve handwriting and document analysis within a multi-script and multi-lingual environment. Moreover, it exhibits a profound connection with human cognition. This paper provides a new database for benchmarking script identification algorithms, which contains both printed and handwritten documents collected from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers. Further, these documents are segmented into lines and words, comprising a total of 13,979 and 86,655 lines and words, respectively, in the dataset. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods. The benchmark includes results at the document, line, and word levels with printed and handwritten documents. Results of script identification independent of the document/line/word level and independent of the printed/handwritten letters are also given. The new multi-lingual database is expected to create new script identifiers, present various challenges, including identifying handwritten and printed samples and serve as a foundation for future research in script identification based on the reported results of the three benchmarks.

* Cognitive Computation, Volume 16, pages 131 to 157,(2024)

Via

Access Paper or Ask Questions