Abstract:Accurate prediction of evacuation behavior is critical for disaster preparedness, yet models trained in one region often fail elsewhere. Using a multi-state hurricane evacuation survey, we show this failure goes beyond feature distribution shift: households with similar characteristics follow systematically different decision patterns across states. As a result, single global models overfit dominant responses, misrepresent vulnerable subpopulations, and generalize poorly across locations. We propose Population-Adaptive Symbolic Mixture-of-Experts (PASM), which pairs large language model guided symbolic regression with a mixture-of-experts architecture. PASM discovers human-readable closed-form decision rules, specializes them to data-driven subpopulations, and routes each input to the appropriate expert at inference time. On Hurricanes Harvey and Irma data, transferring from Florida and Texas to Georgia with 100 calibration samples, PASM achieves a Matthews correlation coefficient of 0.607, compared to XGBoost (0.404), TabPFN (0.333), GPT-5-mini (0.434), and meta-learning baselines MAML and Prototypical Networks (MCC $\leq$ 0.346). The routing mechanism assigns distinct formula archetypes to subpopulations, so the resulting behavioral profiles are directly interpretable. A fairness audit across four demographic axes finds no statistically significant disparities after Bonferroni correction. PASM closes more than half the cross-location generalization gap while keeping decision rules transparent enough for real-world emergency planning.
Abstract:Accurate greenhouse gas (GHG) emission reporting is critical for governments, businesses, and investors. However, adoption remains limited particularly among small and medium enterprises due to high implementation costs, fragmented emission factor databases, and a lack of robust sector classification methods. To address these challenges, we introduce Group Reasoning Emission Estimation Networks (GREEN), an AI-driven carbon accounting framework that standardizes enterprise-level emission estimation, constructs a large-scale benchmark dataset, and leverages a novel reasoning approach with large language models (LLMs). Specifically, we compile textual descriptions for 20,850 companies with validated North American Industry Classification System (NAICS) labels and align these with an economic model of carbon intensity factors. By reframing sector classification as an information retrieval task, we fine-tune Sentence-BERT models using a contrastive learning loss. To overcome the limitations of single-stage models in handling thousands of hierarchical categories, we propose a Group Reasoning method that ensembles LLM classifiers based on the natural NAICS ontology, decomposing the task into multiple sub-classification steps. We theoretically prove that this approach reduces classification uncertainty and computational complexity. Experiments on 1,114 NAICS categories yield state-of-the-art performance (83.68% Top-1, 91.47% Top-10 accuracy), and case studies on 20 companies report a mean absolute percentage error (MAPE) of 45.88%. The project is available at: https://huggingface.co/datasets/Yvnminc/ExioNAICS.
Abstract:Household and individual-level sociodemographic data are essential for understanding human-infrastructure interaction and policymaking. However, the Public Use Microdata Sample (PUMS) offers only a sample at the state level, while census tract data only provides the marginal distributions of variables without correlations. Therefore, we need an accurate synthetic population dataset that maintains consistent variable correlations observed in microdata, preserves household-individual and individual-individual relationships, adheres to state-level statistics, and accurately represents the geographic distribution of the population. We propose a deep generative framework leveraging the variational autoencoder (VAE) to generate a synthetic population with the aforementioned features. The methodological contributions include (1) a new data structure for capturing household-individual and individual-individual relationships, (2) a transfer learning process with pre-training and fine-tuning steps to generate households and individuals whose aggregated distributions align with the census tract marginal distribution, and (3) decoupled binary cross-entropy (D-BCE) loss function enabling distribution shift and out-of-sample records generation. Model results for an application in Delaware, USA demonstrate the ability to ensure the realism of generated household-individual records and accurately describe population statistics at the census tract level compared to existing methods. Furthermore, testing in North Carolina, USA yielded promising results, supporting the transferability of our method.