Consumer financial complaints provide a valuable source of information for identifying service failures, dispute frictions, and operational deficiencies in consumer-facing financial institutions. This paper proposes a hybrid machine learning framework for predicting monetary relief outcomes using Consumer Financial Protection Bureau complaint data. We formulate the task as an imbalanced binary classification problem, where complaints closed with monetary relief are treated as compensable outcomes. The proposed framework integrates multiple sources of predictive information, including complaint narrative text, LDA-based topic representations, interpretable text-engineered features, and structured categorical attributes such as company and state. An XGBoost classifier is trained using a temporal train-test split, with earlier complaints used for model development and more recent complaints reserved for out-of-sample evaluation. Compared with a TF-IDF baseline, the proposed framework substantially improves predictive performance, increasing AUC-ROC from 0.69 to 0.78 and improving PR-AUC under class imbalance. Feature importance analysis shows that textual signals, latent complaint topics, and company identity all contribute meaningful predictive information. In particular, company-level effects reveal systematic variation in complaint resolution patterns across financial institutions. These findings suggest that consumer complaint narratives can serve as alternative data for monitoring consumer harm, identifying firm-level operational weaknesses, and supporting early-stage risk surveillance in consumer finance.
This study explores gender differences in research topic choice and methodology among collaborating scholars. Previous studies have often focused on gender differences in research topics or methods at the individual level of scholars, without considering collaborating groups, lacking depth and practical guidance. This study takes Library and Information Science (LIS) as an example, employing the Top2Vec method for topic identification and the CogFT model for research method classification. It systematically analyzes 25,204 papers published between 1990 and 2022 to investigate gender differences in the convergence of research topics and method choices among collaborating scholars in this field. The results of the study found that female scholars showed lower convergence in their research methods and topic choices compared to male scholars. This study uses a relatively systematic methodology to address the difficulty of studying gender differences in academic publishing, and is expected to serve as a reference for other disciplines and research questions. This study also emphasizes the manifestation of gender differences in collaborative research and provides insights into the convergence and diversity of research topics and methods chosen by scholars.
Academic age critically shapes career development, influencing research behavior, output volume, and methodological choices. Analyzing method variation across academic ages offers a new theoretical lens on scholarly evolution and provides early-career researchers with practical guidance for method selection. A corpus of 26,677 articles published 1990-2023 in 14 authoritative Library and Information Science journals was compiled. The CogFT model automatically classified the research methods embedded in these articles, and Top2Vec generated the topic model. This process resulted in a comprehensive dataset linking research methods with topics. Author-name disambiguation enabled calculation of each scholar's academic age. Popularity and Shannon diversity indices for methods, together with topic diversity, were compared across academic age groups. Results reveal dynamic methodological trends: the share of theoretical approaches declined gradually, whereas experimental and bibliometric methods gained ground. Method popularity differs significantly among cohorts. Mid-career scholars exhibit the highest method diversity; late-career scholars the lowest.
The open-source ecosystem on GitHub lacks a systematic hierarchical taxonomy of software repositories. GitHub Topics, the dominant organizational mechanism, is flat, inconsistent, and covers only 67% of projects. We present ATLAS, the first framework that automatically constructs a hierarchical taxonomy for software repositories and classifies projects into it end-to-end. By combining LLM global knowledge with real repository distributions, ATLAS proposes meaningful splitting dimensions and iteratively corrects those that fail to accommodate real projects. A Designer Agent proposes splitting dimensions while a Classifier Agent assigns repositories; a self-corrective refinement loop uses classification failures to drive dimension revision through escalating strategies. We evaluate ATLAS on 54,387 GitHub repositories against six baselines spanning four paradigms, two downstream tasks, and three model families. On a stratified 2,001-repository benchmark, ATLAS achieves a Taxonomy Quality F-score (TQF) of 83.13%, outperforming the best baseline by 15 percentage points (on the full 54k corpus the approximate TQF is 73.0%, a gap driven by Path Granularity's all-or-nothing scoring on longer paths rather than lower classification accuracy). It is the only method to simultaneously achieve high structural quality and high practical applicability. On downstream tasks, ATLAS enables alternative discovery with P@1 = 85.71%, surpassing even human-curated lists (62.34%), and achieves the highest P@1 for repository retrieval. The taxonomy further reveals structural ecosystem trends that are difficult to obtain from flat tags or similarity methods: the shift from libraries to AI/ML applications (now 61% of newly community-adopted projects) becomes visible only through hierarchical, type-based categorization. An interactive taxonomy explorer is available at https://atlas-taxonomy.netlify.app/
Meeting archives are difficult to search when users remember what was discussed but not when. We study topic-to-timestamp alignment: given a natural-language topic and a timestamped meeting transcript, the goal is to return the time at which the topic is discussed. A standard RAG setup can retrieve relevant transcript excerpts, but still asks the language model to generate a timestamp, which can produce unsupported or invalid timecodes. We therefore recast timestamp prediction as constrained temporal candidate selection: the system retrieves timestamped transcript chunks, and the model selects the candidate that best grounds the topic instead of generating a timecode. On 420 topic-timestamp queries from 200 municipal meeting transcripts, this increases Recall@5 from 31.9% to 50.0%, reduces MAE from 837.0 seconds to 761.0 seconds with Mistral-7B-Instruct, and increases the number of parseable outputs from 373 to 419 of 420 queries. The results suggest that temporal grounding in long transcripts depends strongly on retrieval quality and output design, not only on the choice of the language model.
Advanced Air Mobility (AAM) is an emerging low-altitude air transportation system whose successful deployment depends not only on technological advancement but also on public acceptance. This acceptance will drive government support, regulations, noise standards, and willingness to fly, and in turn the overall commercial viability of AAM. Understanding public sentiment toward AAM is therefore essential for identifying its societal barriers and informing its adoption strategies. This study analyzes 306,009 human-generated texts collected from Reddit and Quora to examine public discourse on AAM using AI-based models. Because multiple sentiment analysis models exist, identifying the most accurate model is critical for reliable AAM sentiment prediction and trustworthy public opinion analysis. Accordingly, seven models spanning lexicon-based, machine learning, deep learning, and transformer-based approaches are evaluated for AAM-specific sentiment classification. ModernBERT achieves the best classification performance and is used to label the full dataset. Using the resulting sentiment labels, Latent Dirichlet Allocation (LDA) is applied within each sentiment class to uncover latent topics in public opinion. The analysis identifies 20 distinct topics and traces their temporal evolution from 2008 to 2025. A cross-sentiment topic analysis further reveals six major clusters of public concern: workforce and skill development (25.29% of the dataset), regulation and compliance (24.64%), technical performance of drones (20.99%), military, geopolitics, and defense (14.58%), safety and operational risks (8.51%), and noise and disturbance (5.98%). Based on these findings, this study provides actionable strategies to address these concerns, thereby, improving public acceptance and support AAM deployment.
Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we employ a multi-stage preprocessing pipeline that includes normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing control across domains and registers. We additionally train a 139k-vocabulary BPE tokenizer on the full pretraining corpus to better capture Persian morphology and orthographic variation. IHUBERT is evaluated on seven Persian NLU benchmarks covering NER, sentiment analysis, topic classification, NLI, extractive question answering, and relation extraction, using task-standard metrics (entity-level F1, Macro-F1, EM/F1). IHUBERT achieves its strongest gains on extractive QA, ranking first on both PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and attains the best result on FarsTail (Macro-F1 0.8350). On NER and topic classification, it remains competitive (e.g., 0.8308 F1 on ParsTwiNER; 0.7953 Macro-F1 on DigiMag), while relation extraction remains the main remaining gap (0.6684 Macro-F1 on PERLEX). A controlled tokenizer ablation on the IHUBERT pretraining corpus shows that BPE yields slightly lower subword fragmentation than WordPiece at matched vocabulary size, supporting our tokenization design. Overall, IHUBERT advances Persian language modeling through semantically curated large-scale pretraining and broad evaluation across both classification and comprehension-oriented tasks.
As individuals turn to the Internet to find answers to questions they may have, several Question Answering (QA) forums have evolved, where users knowledgeable in certain topics can contribute their expertise to answering these requests for information. While these are currently volunteer based, we consider a future version employing knowledge workers who are experts in certain topics. In such a system, the request-answer processes forming the queuing system may utilize schedulers that assign requests in different topics to the experts in the forum, who may be able to answer them according to their expertise levels in different topics. With this model, we calculate the capacity of the system for handling the requests while keeping the system stable, and design schedulers that achieve capacity. We also investigate how collaboration between experts in answering requests can potentially increase capacity.
Generating high-utility synthetic data for intent classification typically requires human-annotated seed data, which is often unavailable in fast-paced industrial settings. In this paper, we propose a framework for synthetic dialogue generation that works entirely without human-annotated data, relying solely on intent definitions. Our proposed dialogue generation framework utilizes two different types of topic and style attributes to improve data diversity. Also, we propose two novel post-hoc stylization models called Univ and Exam to transform synthetic LLM-generated utterances into more varied, human-like linguistic styles. To enhance data quality, we utilize an LLM-as-a-judge filtering process. Experimental results on both industrial and public datasets demonstrate that the proposed approach achieves up to 93.3% of the performance obtained using human-annotated training data. Crucially, the findings reveal that style diversity is more critical than topic diversity for synthetic data utility, as it prevents models from learning spurious stylistic correlations. Furthermore, the study shows that incorporating style attributes during the generation process is more effective than post-hoc style adaptation.
Vector hubness, where a few points become nearest neighbors of many queries, creates a poisoning risk in retrieval-augmented generation (RAG): one injected document can influence unrelated requests. Existing defenses use periodic reverse-kNN scans, leaving an exposure window and repeated corpus-wide work. We study admission-time control, scoring each candidate against sentinel queries and quarantining hub-like documents before insertion. Across two 100,000-document corpora, five encoders, and disjoint attacker and defender query sets, a global gate achieves recall 1.0 at the decisive embedding-space point (>=0.92 across the effective range) and 0.91 +/- 0.07 on HotFlip attacks, with 1% false positives on general documents. A per-topic gate provides no reliable benefit, consistent with anisotropy coupling local and global visibility. Thresholds are maintained incrementally, with corpus-size-independent insertion cost and amortized deletion cost. On HNSW, admission adds about 3.1% to ingestion latency, scoring remains flat to 10^6 vectors, and 1.2% of decisions flip under approximate indexing, none involving attacks. Provenance complements the gate for natural or tight-domain hubs.