Tutor: Sommer

Wang et al. demonstrated that a language model can improve on unseen mathematical
problems when trained on a single repeated example using Group Relative
Policy Optimisation (GRPO).
This thesis reproduces that result and extends the
one-shot Reinforcement Learning with Verifiable Rewards (RLVR) setting to two
alternative post-training paradigms: Evolution Strategies and Self-Distillation Policy
Optimisation (SDPO). The goal is to examine whether the one-shot effect is specific
to GRPO or transfers to optimisation methods that convert the same binary verifier
reward through a different mechanism.
Raum 04.137, Martensstr. 3, Erlangen
or
Zoom:
https://fau.zoom-x.de/j/68350702053?pwd=UkF3aXY0QUdjeSsyR0tyRWtLQ0hYUT09
Meeting-ID: 683 5070 2053
Kenncode: 647333