Text classification is the process of categorizing text documents into predefined categories or labels.
Multimodal Attributed Graphs (MAGs) have been widely adopted for modeling complex systems by integrating multi-modal information, such as text and images, on nodes. However, we identify a discrepancy between the implicit semantic structure induced by different modality embeddings and the explicit graph structure. For instance, neighbors in the explicit graph structure may be close in one modality but distant in another. Since existing methods typically perform message passing over the fixed explicit graph structure, they inadvertently aggregate dissimilar features, introducing modality-specific noise and impeding effective node representation learning. To address this, we propose OptiMAG, an Unbalanced Optimal Transport-based regularization framework. OptiMAG employs the Fused Gromov-Wasserstein distance to explicitly guide cross-modal structural consistency within local neighborhoods, effectively mitigating structural-semantic conflicts. Moreover, a KL divergence penalty enables adaptive handling of cross-modal inconsistencies. This framework can be seamlessly integrated into existing multimodal graph models, acting as an effective drop-in regularizer. Experiments demonstrate that OptiMAG consistently outperforms baselines across multiple tasks, ranging from graph-centric tasks (e.g., node classification, link prediction) to multimodal-centric generation tasks (e.g., graph2text, graph2image). The source code will be available upon acceptance.
Sentiment analysis for the Bengali language has attracted increasing research interest in recent years. However, progress remains constrained by the scarcity of large-scale and diverse annotated datasets. Although several Bengali sentiment and hate speech datasets are publicly available, most are limited in size or confined to a single domain, such as social media comments. Consequently, these resources are often insufficient for training modern deep learning based models, which require large volumes of heterogeneous data to learn robust and generalizable representations. In this work, we introduce BengaliSent140, a large-scale Bengali binary sentiment dataset constructed by consolidating seven existing Bengali text datasets into a unified corpus. To ensure consistency across sources, heterogeneous annotation schemes are systematically harmonized into a binary sentiment formulation with two classes: Not Hate (0) and Hate (1). The resulting dataset comprises 139,792 unique text samples, including 68,548 hate and 71,244 not-hate instances, yielding a relatively balanced class distribution. By integrating data from multiple sources and domains, BengaliSent140 offers broader linguistic and contextual coverage than existing Bengali sentiment datasets and provides a strong foundation for training and benchmarking deep learning models. Baseline experimental results are also reported to demonstrate the practical usability of the dataset. The dataset is publicly available at https://www.kaggle.com/datasets/akifislam/bengalisent140/
Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP's vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP's original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.
Graph Foundation Models (GFMs) have emerged as a frontier in graph learning, which are expected to deliver transferable representations across diverse tasks. However, GFMs remain constrained by in-memory bottlenecks: they attempt to encode knowledge into model parameters, which limits semantic capacity, introduces heavy lossy compression with conflicts, and entangles graph representation with the knowledge in ways that hinder efficient adaptation, undermining scalability and interpretability. In this work,we propose RAG-GFM, a Retrieval-Augmented Generation aided Graph Foundation Model that offloads knowledge from parameters and complements parameterized learning. To externalize graph knowledge, we build a dual-modal unified retrieval module, where a semantic store from prefix-structured text and a structural store from centrality-based motif. To preserve heterogeneous information, we design a dual-view alignment objective that contrasts both modalities to capture both content and relational patterns. To enable efficient downstream adaptation, we perform in-context augmentation to enrich supporting instances with retrieved texts and motifs as contextual evidence. Extensive experiments on five benchmark graph datasets demonstrate that RAG-GFM consistently outperforms 13 state-of-the-art baselines in both cross-domain node and graph classification, achieving superior effectiveness and efficiency.
Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives. SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations. The proposed SLAP model achieves new state-of-the-art performance on audio-text retrieval and zero-shot audio classification tasks, demonstrating its effectiveness across diverse benchmarks.
Learning representative embeddings for different types of speaking styles, such as emotion, age, and gender, is critical for both recognition tasks (e.g., cognitive computing and human-computer interaction) and generative tasks (e.g., style-controllable speech generation). In this work, we introduce ParaMETA, a unified and flexible framework for learning and controlling speaking styles directly from speech. Unlike existing methods that rely on single-task models or cross-modal alignment, ParaMETA learns disentangled, task-specific embeddings by projecting speech into dedicated subspaces for each type of style. This design reduces inter-task interference, mitigates negative transfer, and allows a single model to handle multiple paralinguistic tasks such as emotion, gender, age, and language classification. Beyond recognition, ParaMETA enables fine-grained style control in Text-To-Speech (TTS) generative models. It supports both speech- and text-based prompting and allows users to modify one speaking styles while preserving others. Extensive experiments demonstrate that ParaMETA outperforms strong baselines in classification accuracy and generates more natural and expressive speech, while maintaining a lightweight and efficient model suitable for real-world applications.
Knowledge Graphs~(KGs) often suffer from unreliable knowledge, which restricts their utility. Triple Classification~(TC) aims to determine the validity of triples from KGs. Recently, text-based methods learn entity and relation representations from natural language descriptions, significantly improving the generalization capabilities of TC models and setting new benchmarks in performance. However, there are still two critical challenges. First, existing methods often ignore the effective semantic interaction among different KG components. Second, most approaches adopt single binary classification training objective, leading to insufficient semantic representation learning. To address these challenges, we propose \textbf{SASA}, a novel framework designed to enhance TC models via separated attention mechanism and semantic-aware contrastive learning~(CL). Specifically, we first propose separated attention mechanism to encode triples into decoupled contextual representations and then fuse them through a more effective interactive way. Then, we introduce semantic-aware hierarchical CL as auxiliary training objective to guide models in improving their discriminative capabilities and achieving sufficient semantic learning, considering both local level and global level CL. Experimental results across two benchmark datasets demonstrate that SASA significantly outperforms state-of-the-art methods. In terms of accuracy, we advance the state-of-the-art by +5.9\% on FB15k-237 and +3.4\% on YAGO3-10.
Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.
The proliferation of sophisticated generative AI models has significantly escalated the threat of synthetic manipulations in identity documents, particularly through face swapping and text inpainting attacks. This paper presents TwoHead-SwinFPN, a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents. Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation. The model employs a dual-head architecture for joint optimization of detection and segmentation tasks, utilizing uncertainty-weighted multi-task learning. Extensive experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31\% accuracy, 90.78\% AUC for classification, and 57.24\% mean Dice score for localization. The proposed method achieves an F1-score of 88.61\% for binary classification while maintaining computational efficiency suitable for real-world deployment through FastAPI implementation. Our comprehensive evaluation includes ablation studies, cross-device generalization analysis, and detailed performance assessment across 10 languages and 3 acquisition devices.
From school playgrounds to corporate boardrooms, status hierarchies -- rank orderings based on respect and perceived competence -- are universal features of human social organization. Language models trained on human-generated text inevitably encounter these hierarchical patterns embedded in language, raising the question of whether they might reproduce such dynamics in multi-agent settings. This thesis investigates when and how language models form status hierarchies by adapting Berger et al.'s (1972) expectation states framework. I create multi-agent scenarios where separate language model instances complete sentiment classification tasks, are introduced with varying status characteristics (e.g., credentials, expertise), then have opportunities to revise their initial judgments after observing their partner's responses. The dependent variable is deference, the rate at which models shift their ratings toward their partner's position based on status cues rather than task information. Results show that language models form significant status hierarchies when capability is equal (35 percentage point asymmetry, p < .001), but capability differences dominate status cues, with the most striking effect being that high-status assignments reduce higher-capability models' deference rather than increasing lower-capability models' deference. The implications for AI safety are significant: status-seeking behavior could introduce deceptive strategies, amplify discriminatory biases, and scale across distributed deployments far faster than human hierarchies form organically. This work identifies emergent social behaviors in AI systems and highlights a previously underexplored dimension of the alignment challenge.