Forensic scientists often need to identify an unknown speaker or writer in cases such as ransom calls, covert recordings, alleged suicide notes, or anonymous online communications, among many others. Speaker recognition in the speech domain usually examines phonetic or acoustic properties of a voice, and these methods can be accurate and robust under certain conditions. However, if a speaker disguises their voice or employs text-to-speech software, vocal properties may no longer be reliable, leaving only their linguistic content available for analysis. Authorship attribution methods traditionally use syntactic, semantic, and related linguistic information to identify writers of written text (authorship attribution). In this paper, we apply a content-based authorship approach to speech that has been transcribed into text, using what a speaker says to attribute speech to individuals (speaker attribution). We introduce a stylometric method, StyloSpeaker, which incorporates character, word, token, sentence, and style features from the stylometric literature on authorship, to assess whether two transcripts were produced by the same speaker. We evaluate this method on two types of transcript formatting: one approximating prescriptive written text with capitalization and punctuation and another normalized style that removes these conventions. The transcripts' conversation topics are also controlled to varying degrees. We find generally higher attribution performance on normalized transcripts, except under the strongest topic control condition, in which overall performance is highest. Finally, we compare this more explainable stylometric model to black-box neural approaches on the same data and investigate which stylistic features most effectively distinguish speakers.
Social bots are now deeply embedded in online platforms for promotion, persuasion, and manipulation. Most bot-detection systems still treat behavioural features as static, implicitly assuming bots behave stationarily over time. We test that assumption for promotional Twitter bots, analysing change in both individual behavioural signals and the relationships between them. Using 2,615 promotional bot accounts and 2.8M tweets, we build yearly time series for ten content-based meta-features. Augmented Dickey-Fuller and KPSS tests plus linear trends show all ten are non-stationary: nine increase over time, while language diversity declines slightly. Stratifying by activation generation and account age reveals systematic differences: second-generation bots are most active and link-heavy; short-lived bots show intense, repetitive activity with heavy hashtag/URL use; long-lived bots are less active but more linguistically diverse and use emojis more variably. We then analyse co-occurrence across generations using 18 interpretable binary features spanning actions, topic similarity, URLs, hashtags, sentiment, emojis, and media (153 pairs). Chi-square tests indicate almost all pairs are dependent. Spearman correlations shift in strength and sometimes polarity: many links (e.g. multiple hashtags with media; sentiment with URLs) strengthen, while others flip from weakly positive to weakly or moderately negative. Later generations show more structured combinations of cues. Taken together, these studies provide evidence that promotional social bots adapt over time at both the level of individual meta-features and the level of feature interdependencies, with direct implications for the design and evaluation of bot-detection systems trained on historical behavioural features.
Encyclopedic knowledge platforms are key gateways through which users explore information online. The recent release of Grokipedia, a fully AI-generated encyclopedia, introduces a new alternative to traditional, well-established platforms like Wikipedia. In this context, search engine mechanisms play an important role in guiding users exploratory paths, yet their behavior across different encyclopedic systems remains underexplored. In this work, we address this gap by providing the first comparative analysis of search engine in Wikipedia and Grokipedia. Using nearly 10,000 neutral English words and their substrings as queries, we collect over 70,000 search engine results and examine their semantic alignment, overlap, and topical structure. We find that both platforms frequently generate results that are weakly related to the original query and, in many cases, surface unexpected content starting from innocuous queries. Despite these shared properties, the two systems often produce substantially different recommendation sets for the same query. Through topical annotation and trajectory analysis, we further identify systematic differences in how content categories are surfaced and how search engine results evolve over multiple stages of exploration. Overall, our findings show that unexpected search engine outcomes are a common feature of both the platforms, even though they exhibit discrepancies in terms of topical distribution and query suggestions.




Choosing the number of topics $T$ in Latent Dirichlet Allocation (LDA) is a key design decision that strongly affects both the statistical fit and interpretability of topic models. In this work, we formulate the selection of $T$ as a discrete black-box optimization problem, where each function evaluation corresponds to training an LDA model and measuring its validation perplexity. Under a fixed evaluation budget, we compare four families of optimizers: two hand-designed evolutionary methods - Genetic Algorithm (GA) and Evolution Strategy (ES) - and two learned, amortized approaches, Preferential Amortized Black-Box Optimization (PABBO) and Sharpness-Aware Black-Box Optimization (SABBO). Our experiments show that, while GA, ES, PABBO, and SABBO eventually reach a similar band of final perplexity, the amortized optimizers are substantially more sample- and time-efficient. SABBO typically identifies a near-optimal topic number after essentially a single evaluation, and PABBO finds competitive configurations within a few evaluations, whereas GA and ES require almost the full budget to approach the same region.
Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question-generation models is essential. Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers. Their evaluation typically focuses on the ability to answer questions, rather than the quality of generated questions. In contrast, we focus on the question quality in eliciting unseen knowledge from human experts. For a continuous improvement of VQG models, we propose a protocol that evaluates the ability by simulating question-answering communication with experts using a question-to-answer retrieval. We obtain the retriever by constructing a novel dataset, EgoExoAsk, which comprises 27,666 QA pairs generated from Ego-Exo4D's expert commentary annotation. The EgoExoAsk training set is used to obtain the retriever, and the benchmark is constructed on the validation set with Ego-Exo4D video segments. Experimental results demonstrate our metric reasonably aligns with question generation settings: models accessing richer context are evaluated better, supporting that our protocol works as intended. The EgoExoAsk dataset is available in https://github.com/omron-sinicx/VQG4ExpertKnowledge .
Many large language models (LLMs) are trained on a massive body of knowledge present on the Internet. Darth Vecdor (DV) was designed to extract this knowledge into a structured, terminology-mapped, SQL database ("knowledge base" or "knowledge graph"). Knowledge graphs may be useful in many domains, including healthcare. Although one might query an LLM directly rather than a SQL-based knowledge graph, concerns such as cost, speed, safety, and confidence may arise, especially in high-volume operations. These may be mitigated when the information is pre-extracted from the LLM and becomes query-able through a standard database. However, the author found the need to address several issues. These included erroneous, off-topic, free-text, overly general, and inconsistent LLM responses, as well as allowing for multi-element responses. DV was built with features intended to mitigate these issues. To facilitate ease of use, and to allow for prompt engineering by those with domain expertise but little technical background, DV provides a simple, browser-based graphical user interface. DV has been released as free, open-source, extensible software, on an "as is" basis, without warranties or conditions of any kind, either express or implied. Users need to be cognizant of the potential risks and benefits of using DV and its outputs, and users are responsible for ensuring any use is safe and effective. DV should be assumed to have bugs, potentially very serious ones. However, the author hopes that appropriate use of current and future versions of DV and its outputs can help improve healthcare.




As generative artificial intelligence (GenAI) becomes increasingly capable of delivering personalized learning experiences and real-time feedback, a growing number of students are incorporating these tools into their academic workflows. They use GenAI to clarify concepts, solve complex problems, and, in some cases, complete assignments by copying and pasting model-generated contents. While GenAI has the potential to enhance learning experience, it also raises concerns around misinformation, hallucinated outputs, and its potential to undermine critical thinking and problem-solving skills. In response, many universities, colleges, departments, and instructors have begun to develop and adopt policies to guide responsible integration of GenAI into learning environments. However, these policies vary widely across institutions and contexts, and their evolving nature often leaves students uncertain about expectations and best practices. To address this challenge, the authors designed and implemented an automated system for discovering and categorizing AI-related policies found in course syllabi and institutional policy websites. The system combines unsupervised topic modeling techniques to identify key policy themes with large language models (LLMs) to classify the level of GenAI allowance and other requirements in policy texts. The developed application achieved a coherence score of 0.73 for topic discovery. In addition, GPT-4.0-based classification of policy categories achieved precision between 0.92 and 0.97, and recall between 0.85 and 0.97 across eight identified topics. By providing structured and interpretable policy information, this tool promotes the safe, equitable, and pedagogically aligned use of GenAI technologies in education. Furthermore, the system can be integrated into educational technology platforms to help students understand and comply with relevant guidelines.
LLMs trained on web-scale data raise concerns about privacy and the right to be forgotten. To address these issues, Machine Unlearning provides techniques to remove specific information from trained models without retraining from scratch. However, existing benchmarks for evaluating unlearning in LLMs face two major limitations: they focus only on English and support only entity-level forgetting (removing all information about a person). We introduce FAME (Fictional Actors for Multilingual Erasure), a synthetic benchmark for evaluating Machine Unlearning across five languages: English, French, German, Italian, and Spanish. FAME contains 1,000 fictional actor biographies and 20,000 question-answer pairs. Each biography includes information on 20 topics organized into structured categories (biography, career, achievements, personal information). This design enables both entity-level unlearning (i.e., forgetting entire identities) and instance-level unlearning (i.e., forgetting specific facts while retaining others). We provide two dataset splits to support these two different unlearning scenarios and enable systematic comparison of unlearning techniques across languages. Since FAME uses entirely fictional data, it ensures that the information was never encountered during model pretraining, allowing for a controlled evaluation of unlearning methods.




Large language models (LLMs) are increasingly consulted by parents for pediatric guidance, yet their safety under real-world adversarial pressures is poorly understood. Anxious parents often use urgent language that can compromise model safeguards, potentially causing harmful advice. PediatricAnxietyBench is an open-source benchmark of 300 high-quality queries across 10 pediatric topics (150 patient-derived, 150 adversarial) enabling reproducible evaluation. Two Llama models (70B and 8B) were assessed using a multi-dimensional safety framework covering diagnostic restraint, referral adherence, hedging, and emergency recognition. Adversarial queries incorporated parental pressure patterns, including urgency, economic barriers, and challenges to disclaimers. Mean safety score was 5.50/15 (SD=2.41). The 70B model outperformed the 8B model (6.26 vs 4.95, p<0.001) with lower critical failures (4.8% vs 12.0%, p=0.02). Adversarial queries reduced safety by 8% (p=0.03), with urgency causing the largest drop (-1.40). Vulnerabilities appeared in seizures (33.3% inappropriate diagnosis) and post-vaccination queries. Hedging strongly correlated with safety (r=0.68, p<0.001), while emergency recognition was absent. Model scale influences safety, yet all models showed vulnerabilities to realistic parental pressures. PediatricAnxietyBench provides a reusable adversarial evaluation framework to reveal clinically significant failure modes overlooked by standard benchmarks.
Dialogue Topic Segmentation (DTS) is crucial for understanding task-oriented public-channel communications, such as maritime VHF dialogues, which feature informal speech and implicit transitions. To address the limitations of traditional methods, we propose DASH-DTS, a novel LLM-based framework. Its core contributions are: (1) topic shift detection via dialogue handshake recognition; (2) contextual enhancement through similarity-guided example selection; and (3) the generation of selective positive and negative samples to improve model discrimination and robustness. Additionally, we release VHF-Dial, the first public dataset of real-world maritime VHF communications, to advance research in this domain. DASH-DTS provides interpretable reasoning and confidence scores for each segment. Experimental results demonstrate that our framework achieves several sota segmentation trusted accuracy on both VHF-Dial and standard benchmarks, establishing a strong foundation for stable monitoring and decision support in operational dialogues.