Abstract:Structured launch signals on Product Hunt contain statistically significant predictive information for Series A funding outcomes. We construct PHBench from 67,292 featured Product Hunt posts spanning 2019-2025, linked to Crunchbase funding records via deterministic domain matching, identifying 528 verified Series A raises within 18 months of launch (positive rate: 0.78%). Our best-performing model, a three-component ensemble (ENS_avg, ENS_ISO, XGB) selected by validation F0.5, achieves F0.5 = 0.097 and AP = 0.037 (95% CI: 0.024-0.072; 4.7x lift over random) on the private held-out test set (103 positives). A paired bootstrap confirms a statistically credible advantage over the logistic regression baseline (AP delta: +0.013, 95% CI: [0.004, 0.039], p < 0.001; F0.5 delta: +0.056, 95% CI: [0.006, 0.122], p = 0.016). Validation-set metrics (F0.5 = 0.284, AP = 0.126) reflect best-of-144 selection bias on 53 positives and are reported for benchmark reproducibility only. We further evaluate three zero-shot Gemini models (Gemini 2.5 Flash, Gemini 3 Flash, and Gemini 3.1 Pro) in an anonymized numerical setting. The best LLM achieves AP = 0.034 (Gemini 3 Flash), below the LR baseline AP of 0.044. Notably, the most capable Gemini variant (Gemini 3.1 Pro, AP = 0.023) performs worst -- an unexpected pattern that warrants further investigation across providers and prompting strategies. Both ML and LLM models show the same temporal performance decay tracking the 2020-2021 funding boom and subsequent contraction, confirming the dataset captures genuine market structure rather than noise. PHBench provides a reproducible framework comprising public training, validation, and blind test splits; 61 engineered features; a five-metric evaluation harness; and a public leaderboard at https://phbench.com. All code, baseline models, and anonymized dataset splits are publicly available.
Abstract:Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. With the introduction of ever-improving Large Language Models (LLMs), our method provides a structured method for addressing this challenge. LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery. We introduce CoFEE (Cognitive Feature Engineering Engine), a reasoning control framework that enforces cognitive behaviors in how the LLM reasons during feature discovery. From a machine learning perspective, these cognitive behaviors act as structured inductive biases over the space of candidate features generated by the model. These behaviors have been exploited with success in ML models, and include backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths. In a controlled comparison, we show that enforcing cognitive behaviors yields features with higher empirical predictability than those under unconstrained vanilla LLM prompts. CoFEE achieves an average Success Rate Score that is 15.2% higher than the vanilla approach, while generating 29% fewer features and reducing costs by 53.3%. Using held-out feature evaluation, we assess whether cognitively induced features generalize beyond the data used for discovery. Our results indicate that, in our evaluated setting, reasoning control is associated with improvements in quality and efficiency of LLM-based feature discovery.




Abstract:Benchmarks such as SWE-bench and ARC-AGI demonstrate how shared datasets accelerate progress toward artificial general intelligence (AGI). We introduce VCBench, the first benchmark for predicting founder success in venture capital (VC), a domain where signals are sparse, outcomes are uncertain, and even top investors perform modestly. At inception, the market index achieves a precision of 1.9%. Y Combinator outperforms the index by a factor of 1.7x, while tier-1 firms are 2.9x better. VCBench provides 9,000 anonymized founder profiles, standardized to preserve predictive features while resisting identity leakage, with adversarial tests showing more than 90% reduction in re-identification risk. We evaluate nine state-of-the-art large language models (LLMs). DeepSeek-V3 delivers over six times the baseline precision, GPT-4o achieves the highest F0.5, and most models surpass human benchmarks. Designed as a public and evolving resource available at vcbench.com, VCBench establishes a community-driven standard for reproducible and privacy-preserving evaluation of AGI in early-stage venture forecasting.
Abstract:Predicting startup success requires models that are both accurate and interpretable. We present a lightweight ensemble framework that combines YES/NO questions generated by large language models (LLMs), forming a transparent decision-making system. Each question acts as a weak heuristic, and by filtering, ranking, and aggregating them through a threshold-based voting mechanism, we construct a strong ensemble predictor. On a test set where 10% of startups are classified as successful, our approach achieves a precision rate of 50%, representing a 5x improvement over random selection, while remaining fully transparent. When we incorporate expert-guided heuristics into the generation process, performance improves further to 54% precision. These results highlight the value of combining LLM reasoning with human insight and demonstrate that simple, interpretable ensembles can support high-stakes decisions in domains such as venture capital (VC).