Text classification is the process of categorizing text documents into predefined categories or labels.
Joint audio-text models are widely used for music retrieval, yet they struggle with semantic phenomena such as negation. Negation is fundamental for distinguishing the absence (or presence) of musical elements (e.g., "with vocals" vs. "without vocals"), but current systems fail to represent this reliably. In this work, we investigate and mitigate this limitation by training CLAP models from scratch on the Million Song Dataset with LP-MusicCaps-MSD captions. We introduce negation through text augmentation and a dissimilarity-based contrastive loss, designed to explicitly separate original and negated captions in the joint embedding space. To evaluate progress, we propose two protocols that frame negation modeling as retrieval and binary classification tasks. Experiments demonstrate that both methods, individually and combined, improve negation handling while largely preserving retrieval performance.
From school playgrounds to corporate boardrooms, status hierarchies -- rank orderings based on respect and perceived competence -- are universal features of human social organization. Language models trained on human-generated text inevitably encounter these hierarchical patterns embedded in language, raising the question of whether they might reproduce such dynamics in multi-agent settings. This thesis investigates when and how language models form status hierarchies by adapting Berger et al.'s (1972) expectation states framework. I create multi-agent scenarios where separate language model instances complete sentiment classification tasks, are introduced with varying status characteristics (e.g., credentials, expertise), then have opportunities to revise their initial judgments after observing their partner's responses. The dependent variable is deference, the rate at which models shift their ratings toward their partner's position based on status cues rather than task information. Results show that language models form significant status hierarchies when capability is equal (35 percentage point asymmetry, p < .001), but capability differences dominate status cues, with the most striking effect being that high-status assignments reduce higher-capability models' deference rather than increasing lower-capability models' deference. The implications for AI safety are significant: status-seeking behavior could introduce deceptive strategies, amplify discriminatory biases, and scale across distributed deployments far faster than human hierarchies form organically. This work identifies emergent social behaviors in AI systems and highlights a previously underexplored dimension of the alignment challenge.
This study investigates the feature representations produced by publicly available open source medical vision-language models (VLMs). While medical VLMs are expected to capture diagnostically relevant features, their learned representations remain underexplored, and standard evaluations like classification accuracy do not fully reveal if they acquire truly discriminative, lesion-specific features. Understanding these representations is crucial for revealing medical image structures and improving downstream tasks in medical image analysis. This study aims to investigate the feature distributions learned by medical VLMs and evaluate the impact of medical specialization. We analyze the feature distribution of multiple image modalities extracted by some representative medical VLMs across lesion classification datasets on multiple modalities. These distributions were compared them with non-medical VLMs to assess the domain-specific medical training. Our experiments showed that medical VLMs can extract discriminative features that are effective for medical classification tasks. Moreover, it was found that non-medical VLMs with recent improvement with contextual enrichment such as LLM2CLIP produce more refined feature representations. Our results imply that enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs. Notably, non-medical models are particularly vulnerable to biases introduced by overlaied text strings on images. These findings underscore the need for careful consideration on model selection according to downstream tasks besides potential risks in inference due to background biases such as textual information in images.
Graph Foundation Models (GFMs) have emerged as a frontier in graph learning, which are expected to deliver transferable representations across diverse tasks. However, GFMs remain constrained by in-memory bottlenecks: they attempt to encode knowledge into model parameters, which limits semantic capacity, introduces heavy lossy compression with conflicts, and entangles graph representation with the knowledge in ways that hinder efficient adaptation, undermining scalability and interpretability. In this work,we propose RAG-GFM, a Retrieval-Augmented Generation aided Graph Foundation Model that offloads knowledge from parameters and complements parameterized learning. To externalize graph knowledge, we build a dual-modal unified retrieval module, where a semantic store from prefix-structured text and a structural store from centrality-based motif. To preserve heterogeneous information, we design a dual-view alignment objective that contrasts both modalities to capture both content and relational patterns. To enable efficient downstream adaptation, we perform in-context augmentation to enrich supporting instances with retrieved texts and motifs as contextual evidence. Extensive experiments on five benchmark graph datasets demonstrate that RAG-GFM consistently outperforms 13 state-of-the-art baselines in both cross-domain node and graph classification, achieving superior effectiveness and efficiency.
While Large Language Models (LLMs) produce highly nuanced text simplifications, developers currently lack tools for a holistic, efficient, and reproducible diagnosis of their behavior. This paper introduces the Simplification Profiler, a diagnostic toolkit that generates a multidimensional, interpretable fingerprint of simplified texts. Multiple aggregated simplifications of a model result in a model's fingerprint. This novel evaluation paradigm is particularly vital for languages, where the data scarcity problem is magnified when creating flexible models for diverse target groups rather than a single, fixed simplification style. We propose that measuring a model's unique behavioral signature is more relevant in this context as an alternative to correlating metrics with human preferences. We operationalize this with a practical meta-evaluation of our fingerprints' descriptive power, which bypasses the need for large, human-rated datasets. This test measures if a simple linear classifier can reliably identify various model configurations by their created simplifications, confirming that our metrics are sensitive to a model's specific characteristics. The Profiler can distinguish high-level behavioral variations between prompting strategies and fine-grained changes from prompt engineering, including few-shot examples. Our complete feature set achieves classification F1-scores up to 71.9 %, improving upon simple baselines by over 48 percentage points. The Simplification Profiler thus offers developers a granular, actionable analysis to build more effective and truly adaptive text simplification systems.
Recognizing and navigating client resistance is critical for effective mental health counseling, yet detecting such behaviors is particularly challenging in text-based interactions. Existing NLP approaches oversimplify resistance categories, ignore the sequential dynamics of therapeutic interventions, and offer limited interpretability. To address these limitations, we propose PsyFIRE, a theoretically grounded framework capturing 13 fine-grained resistance behaviors alongside collaborative interactions. Based on PsyFIRE, we construct the ClientResistance corpus with 23,930 annotated utterances from real-world Chinese text-based counseling, each supported by context-specific rationales. Leveraging this dataset, we develop RECAP, a two-stage framework that detects resistance and fine-grained resistance types with explanations. RECAP achieves 91.25% F1 for distinguishing collaboration and resistance and 66.58% macro-F1 for fine-grained resistance categories classification, outperforming leading prompt-based LLM baselines by over 20 points. Applied to a separate counseling dataset and a pilot study with 62 counselors, RECAP reveals the prevalence of resistance, its negative impact on therapeutic relationships and demonstrates its potential to improve counselors' understanding and intervention strategies.
In this paper, we introduce an Adaptive Graph Signal Processing with Dynamic Semantic Alignment (AGSP DSA) framework to perform robust multimodal data fusion over heterogeneous sources, including text, audio, and images. The requested approach uses a dual-graph construction to learn both intra-modal and inter-modal relations, spectral graph filtering to boost the informative signals, and effective node embedding with Multi-scale Graph Convolutional Networks (GCNs). Semantic aware attention mechanism: each modality may dynamically contribute to the context with respect to contextual relevance. The experimental outcomes on three benchmark datasets, including CMU-MOSEI, AVE, and MM-IMDB, show that AGSP-DSA performs as the state of the art. More precisely, it achieves 95.3% accuracy, 0.936 F1-score, and 0.924 mAP on CMU-MOSEI, improving MM-GNN by 2.6 percent in accuracy. It gets 93.4% accuracy and 0.911 F1-score on AVE and 91.8% accuracy and 0.886 F1-score on MM-IMDB, which demonstrate good generalization and robustness in the missing modality setting. These findings verify the efficiency of AGSP-DSA in promoting multimodal learning in sentiment analysis, event recognition and multimedia classification.
Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token, trained jointly with a text encoder. Through a systematic study of design choices, including pretrained text encoders, loss functions, training modes, and data mixtures, we identify configurations that yield state-of-the-art retrieval performance. Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x smaller embeddings with minimal performance drop. WavLink also demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification.
Qualitative research often contains personal, contextual, and organizational details that pose privacy risks if not handled appropriately. Manual anonymization is time-consuming, inconsistent, and frequently omits critical identifiers. Existing automated tools tend to rely on pattern matching or fixed rules, which fail to capture context and may alter the meaning of the data. This study uses local LLMs to build a reliable, repeatable, and context-aware anonymization process for detecting and anonymizing sensitive data in qualitative transcripts. We introduce a Structured Framework for Adaptive Anonymizer (SFAA) that includes three steps: detection, classification, and adaptive anonymization. The SFAA incorporates four anonymization strategies: rule-based substitution, context-aware rewriting, generalization, and suppression. These strategies are applied based on the identifier type and the risk level. The identifiers handled by the SFAA are guided by major international privacy and research ethics standards, including the GDPR, HIPAA, and OECD guidelines. This study followed a dual-method evaluation that combined manual and LLM-assisted processing. Two case studies were used to support the evaluation. The first includes 82 face-to-face interviews on gamification in organizations. The second involves 93 machine-led interviews using an AI-powered interviewer to test LLM awareness and workplace privacy. Two local models, LLaMA and Phi were used to evaluate the performance of the proposed framework. The results indicate that the LLMs found more sensitive data than a human reviewer. Phi outperformed LLaMA in finding sensitive data, but made slightly more errors. Phi was able to find over 91% of the sensitive data and 94.8% kept the same sentiment as the original text, which means it was very accurate, hence, it does not affect the analysis of the qualitative data.
Interpretability is significant in computational pathology, leading to the development of multimodal information integration from histopathological image and corresponding text data.However, existing multimodal methods have limited interpretability due to the lack of high-quality dataset that support explicit reasoning and inference and simple reasoning process.To address the above problems, we introduce a novel multimodal pathology large language model with strong reasoning capabilities.To improve the generation of accurate and contextually relevant textual descriptions, we design a semantic reward strategy integrated with group relative policy optimization.We construct a high-quality pathology visual question answering (VQA) dataset, specifically designed to support complex reasoning tasks.Comprehensive experiments conducted on this dataset demonstrate that our method outperforms state-of-the-art methods, even when trained with only 20% of the data.Our method also achieves comparable performance on downstream zero-shot image classification task compared with CLIP.