Abstract:Fine-tuning is integral for aligning large language models (LLMs) with human preferences. Multiple-Reference Preference Optimization (MRPO) builds on Direct Preference Optimization (DPO) by fine-tuning LLMs on preference datasets while regularizing the policy towards a mixture of reference models to leverage their collective desirable properties. However, current methods for setting the reference weights are ad-hoc and statistically unsound, leading to unreliable performance. To address this, we introduce four new weighting strategies: two offline methods that leverage held-out validation signal; one online method that uses a sliding-window estimator to reduce overfitting; and an online method that treats reference weighting as a $K$-armed bandit via Thompson Sampling. Experiments using Qwen2.5-0.5B as the policy model and seven reference models from the Llama, Mistral, Qwen, Yi, and Phi families (0.5B-14B each) show that all 4 of our strategies outperform the current MRPO weighting methods on UltraFeedback and SafeRLHF in preference accuracy. More thought-provokingly, however, we find that single-reference DPO, using any of 6 out of 7 references, consistently outperforms all tested multiple-reference approaches -- calling into question the practical appeal of multiple-reference approaches.
Abstract:Most evolutionary algorithms (EAs) used in practice employ crossover. In contrast, only for few and mostly artificial examples a runtime advantage from crossover could be proven with mathematical means. The most convincing such result shows that the $(\mu+1)$ genetic algorithm (GA) with population size $\mu=O(n)$ optimizes jump functions with gap size $k \ge 3$ in time $O(n^k / \mu + n^{k-1}\log n)$, beating the $\Theta(n^k)$ runtime of many mutation-based EAs. This result builds on a proof that the GA occasionally and then for an expected number of $\Omega(\mu^2)$ iterations has a population that is not dominated by a single genotype. In this work, we show that this diversity persist with high probability for a time exponential in $\mu$ (instead of quadratic). From this better understanding of the population diversity, we obtain stronger runtime guarantees, among them the statement that for all $c\ln(n)\le\mu \le n/\log n$, with $c$ a suitable constant, the runtime of the $(\mu+1)$ GA on $\mathrm{Jump}_k$, with $k \ge 3$, is $O(n^{k-1})$. Consequently, already with logarithmic population sizes, the GA gains a speed-up of order $\Omega(n)$ from crossover.