Colloquium lecture: Training Paradigms for Tool-Calling Large Language Models, April 21, 2026, 11 a.m. (Tutor: Sommer)

As AI systems shift from pure text generation toward tool-augmented reasoning, getting smaller language models to reliably call external tools has become a key challenge. This thesis investigates how compact, affordable models can be trained for structured tool-calling in mathematical problem solving, using Qwen models at 0.5B and 1.7B parameters. Three training paradigms are compared in this work. Group Relative Policy Optimization (GRPO) is a policy-gradient method, Evolution Strategies (ES) takes a gradient-free approach by perturbing model weights directly, and Self-Distillation Policy Optimization (SDPO) is a self-distillation technique where a feedback-informed copy of the model acts as its own teacher, guiding it toward better tool calls based on prior mistakes. All methods share a reward function evaluating output format, successful tool execution, and numerical accuracy. ES converges fastest and is the most memory-efficient, avoiding backpropagation entirely. GRPO yields meaningful gains across both model sizes, though it requires a longer and less stable training phase when starting from a weaker base model. SDPO consis- tently achieves perfect tool-calling accuracy but plateaus below full mathematical correctness across all configurations, a limitation tied to its reliance on successful rollouts as teacher signal, which become too sparse on harder problem types to drive complete learning. We also explore Low-Rank Adaptation (LoRA) fine-tuning across all methods to reduce memory overhead while retaining performance. Overall, the work shows that choosing the right training approach matters just as much as model size when the goal is efficient tool-calling under real-world hardware constraints.

Room: 04.137, Martensstr. 3, Erlangen

Zoom-Meeting:
https://fau.zoom-x.de/j/68350702053?pwd=UkF3aXY0QUdjeSsyR0tyRWtLQ0hYUT09

Meeting-ID: 683 5070 2053
Kenncode: 647333