Colloquium lecture: June 2, 2026, 11:00 a.m., One-Shot RLVR Generalisation: A Compact Comparison of GRPO, Evolution Strategies, and SDPO


Tutor: Sommer

Bild Besprechungsraum 04.137

Wang et al. demonstrated that a language model can improve on unseen mathematical
problems when trained on a single repeated example using Group Relative
Policy Optimisation (GRPO).

This thesis reproduces that result and extends the
one-shot Reinforcement Learning with Verifiable Rewards (RLVR) setting to two
alternative post-training paradigms: Evolution Strategies and Self-Distillation Policy
Optimisation (SDPO). The goal is to examine whether the one-shot effect is specific
to GRPO or transfers to optimisation methods that convert the same binary verifier
reward through a different mechanism.


Raum 04.137, Martensstr. 3, Erlangen

or

Zoom:
https://fau.zoom-x.de/j/68350702053?pwd=UkF3aXY0QUdjeSsyR0tyRWtLQ0hYUT09

Meeting-ID: 683 5070 2053
Kenncode: 647333