2. June 2026

Colloquium lecture: June, 23, 2026, 10:15 a.m., Self-Play Evolution: Beyond Data and Gradients

Betreuer/in: Sommer

Large language models have made rapid progress on mathematical reasoning, but this depended heavily on human-curated training data—problem sets, worked solutions, and graded examples. This dependence constitutes one of two distinct bottlenecks limiting the scalability of Large Language Model (LLM) reasoning improvement.

The first is the data bottleneck: improving the model beyond the difficulty ceiling of the
available data requires harder data, which requires further human effort. The second
is the optimisation bottleneck: the dominant parameter-update method for reasoning
training, policy-gradient reinforcement learning [1], suffers from high-variance
gradient estimates and the substantial memory overhead of storing activations for
backpropagation, both of which scale poorly to longer training horizons and larger
models.
Self-play addresses the data bottleneck by allowing the model to generate its
own training curriculum through adversarial interaction between two model instances,
with a Challenger producing problems and a Solver attempting them [2], [3].
What remains unsettled is which parameter-update method best translates the
self-generated curriculum into capability improvement, and in particular whether
the second bottleneck can be addressed simultaneously. This thesis isolates the
optimiser as the experimental variable across four paradigms trained inside an identical
R-Zero-style self-play framework: Supervised Fine-Tuning (SP-SFT), Group
Relative Policy Optimisation (SP-GRPO) [1], Self-Distillation Policy Optimisation
(SP-SDPO) [4], and Evolution Strategies (SP-ES) [5], [6]. Three of these paradigms
— SP-GRPO, SP-SDPO, and SP-ES — train on self-generated data alone with no
external examples at any stage; SP-SFT, as an imitation-only baseline, additionally
requires a small set of grade-school anchor examples for training stability, and is
interpreted accordingly. SP-ES is the only paradigm in this set that conceptually
addresses both bottlenecks at once: by estimating parameter updates from scalar
fitness evaluations of randomly perturbed LoRA copies, it removes the need for
backpropagation entirely while inheriting the data-scaling properties of the shared
self-play framework. A fifth variant, SP-GRPO-TC, extends GRPO with calculator
tool calling on a Qwen2.5-7B-Instruct base and is evaluated separately.
Under matched conditions — the same Qwen2.5-3B base, the same LoRA configuration,
the same 1500-sample generation budget per phase, and the same five
self-play iterations — the two reward-based paradigms dominate the math benchmarks:
SP-GRPO gains 11.98 percentage points on GSM8K and SP-SDPO 9.62, against
2.95 for SP-ES and 1.29 for SP-SFT. The same ordering holds on MATH500. On three
out-of-distribution multiple-choice benchmarks (ARC-Challenge, HellaSwag, MMLUSTEM)
all four variants remain within ±0.6 percentage points of base, suggesting
that the math gains do not come at the cost of general capability. SP-GRPO-TC
achieves gains of 4.44 percentage points on MultiArith, 1.74 on GSM-Hard, and
1.00 on SVAMP, with improvement concentrated on benchmarks where multi-step
arithmetic computation is the binding bottleneck.
The results point to several observations worth highlighting. Most notably, the
choice of optimisation method appears to be the dominant constraint on what selfplay
can achieve: four paradigms sharing the same curriculum-generation logic, base
model, and data budget differ by roughly an order of magnitude on the target benchmark
depending solely on how they update parameters. The multiple-choice results
also suggest that effective self-play does not require sacrificing general capability —
the curriculum’s frontier filter acts as an implicit regulariser preserving base-model
performance outside the trained domain. The SP-ES underperformance on GSM8K,
meanwhile, is better characterised as compute-limited rather than method-limited:
eight ES steps per phase on a single GPU yield only 80 paired fitness evaluations, far
fewer effective updates than the policy-gradient paradigms accumulate over the same
phase. SP-ES therefore remains the only paradigm in this set with the architectural
property of addressing both bottlenecks simultaneously, even though the present
single-GPU experiments cannot demonstrate its empirical competitiveness. Equalcompute
comparisons on multi-GPU systems, where antithetic perturbations can be
evaluated in parallel [6], [7], are the natural next test of whether the conceptual
contribution translates into practical performance.

Raum 04.137, Martensstr. 3, Erlangen

oder

Zoom-Meeting beitreten:
https://fau.zoom-x.de/j/68350702053?pwd=UkF3aXY0QUdjeSsyR0tyRWtLQ0hYUT09

Meeting-ID: 683 5070 2053
Kenncode: 647333

Last update: 2026-06-02 - 16:26