Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gyanendra Das

DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

Apr 09, 2026

Gyanendra Das, Sai Satyam Jena

Abstract:Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.

* Accepted at CVPR 2026

Via

Access Paper or Ask Questions

MABNet: Master Assistant Buddy Network with Hybrid Learning for Image Retrieval

Mar 06, 2023

Rohit Agarwal, Gyanendra Das, Saksham Aggarwal, Alexander Horsch, Dilip K. Prasad

Figure 1 for MABNet: Master Assistant Buddy Network with Hybrid Learning for Image Retrieval

Figure 2 for MABNet: Master Assistant Buddy Network with Hybrid Learning for Image Retrieval

Figure 3 for MABNet: Master Assistant Buddy Network with Hybrid Learning for Image Retrieval

Figure 4 for MABNet: Master Assistant Buddy Network with Hybrid Learning for Image Retrieval

Abstract:Image retrieval has garnered growing interest in recent times. The current approaches are either supervised or self-supervised. These methods do not exploit the benefits of hybrid learning using both supervision and self-supervision. We present a novel Master Assistant Buddy Network (MABNet) for image retrieval which incorporates both learning mechanisms. MABNet consists of master and assistant blocks, both learning independently through supervision and collectively via self-supervision. The master guides the assistant by providing its knowledge base as a reference for self-supervision and the assistant reports its knowledge back to the master by weight transfer. We perform extensive experiments on public datasets with and without post-processing.

* Accepted at International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2023

Via

Access Paper or Ask Questions

MAViC: Multimodal Active Learning for Video Captioning

Dec 11, 2022

Gyanendra Das, Xavier Thomas, Anant Raj, Vikram Gupta

Figure 1 for MAViC: Multimodal Active Learning for Video Captioning

Figure 2 for MAViC: Multimodal Active Learning for Video Captioning

Figure 3 for MAViC: Multimodal Active Learning for Video Captioning

Figure 4 for MAViC: Multimodal Active Learning for Video Captioning

Abstract:A large number of annotated video-caption pairs are required for training video captioning models, resulting in high annotation costs. Active learning can be instrumental in reducing these annotation requirements. However, active learning for video captioning is challenging because multiple semantically similar captions are valid for a video, resulting in high entropy outputs even for less-informative samples. Moreover, video captioning algorithms are multimodal in nature with a visual encoder and language decoder. Further, the sequential and combinatorial nature of the output makes the problem even more challenging. In this paper, we introduce MAViC which leverages our proposed Multimodal Semantics Aware Sequential Entropy (M-SASE) based acquisition function to address the challenges of active learning approaches for video captioning. Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function. Our detailed experiments empirically demonstrate the efficacy of M-SASE for active learning for video captioning and improve on the baselines by a large margin.

Via

Access Paper or Ask Questions

GPTs at Factify 2022: Prompt Aided Fact-Verification

Jun 29, 2022

Pawan Kumar Sahu, Saksham Aggarwal, Taneesh Gupta, Gyanendra Das

Figure 1 for GPTs at Factify 2022: Prompt Aided Fact-Verification

Figure 2 for GPTs at Factify 2022: Prompt Aided Fact-Verification

Figure 3 for GPTs at Factify 2022: Prompt Aided Fact-Verification

Abstract:One of the most pressing societal issues is the fight against false news. The false claims, as difficult as they are to expose, create a lot of damage. To tackle the problem, fact verification becomes crucial and thus has been a topic of interest among diverse research communities. Using only the textual form of data we propose our solution to the problem and achieve competitive results with other approaches. We present our solution based on two approaches - PLM (pre-trained language model) based method and Prompt based method. The PLM-based approach uses the traditional supervised learning, where the model is trained to take 'x' as input and output prediction 'y' as P(y|x). Whereas, Prompt-based learning reflects the idea to design input to fit the model such that the original objective may be re-framed as a problem of (masked) language modeling. We may further stimulate the rich knowledge provided by PLMs to better serve downstream tasks by employing extra prompts to fine-tune PLMs. Our experiments showed that the proposed method performs better than just fine-tuning PLMs. We achieved an F1 score of 0.6946 on the FACTIFY dataset and a 7th position on the competition leader-board.

* Accepted in AAAI'22: First Workshop on Multimodal Fact-Checking and Hate Speech Detection, Februrary 22 - March 1, 2022,Vancouver, BC, Canada

Via

Access Paper or Ask Questions