About Us
The G1Bbon benchmark was developed by Gonçalo Guiomar, Elia Torre, Mario Giulianelli and Victoria Shavina at University of Zurich and ETH Zurich under the supervision of Prof. Valerio Mante
The Temporal Reasoning Task is a research platform designed to investigate how different agents - both human and artificial - solve interactive tasks. Our aim is to understanding the underlying mechanisms involved in temporal reasoning and pattern recognition, and to develop benchmarks that can be used to evaluate the performance of language models in these tasks.

Current student members of the G1Bbon team:
- Aydin Javadov, Alice Potter, Arthur Bigot, Daniel Lozano, Ekaterina Kozachenko, Nandor Kofarago, Sarah Verreault
Semester, Masters and PhD projects in ETH, UZH
Your thesis or small project can be focused on one (or multiple) of the topics below:
- Expand Task Space: Develop new Temporal Reasoning Tasks (TRTs) inspired by human and primate cognitive experiments to robustly probe model reasoning capabilities.
- Adapt and Fine-tune Models: Fine-tune selected 1B-scale LLMs using human-derived behavioral data (GRPO) to improve their performance.
- Behavioral Comparison: Analyze and systematically compare model performance with human and primate behavioral data to uncover similarities and differences in cognitive processing.
- Develop Interpretability Methods: Implement novel mechanistic interpretability techniques (e.g., sparse probing, layer-wise activation dynamics, chain-of-thought analysis) to provide deep insights into the evolving representations within the models.
- Optimize Model Size and Performance: Develop new techniques to train the smallest language model that achieves the highest possible performance.
This project will allow you to explore the cognitive potential and limitations of language models and your work will lead directly to publishable insights into temporal reasoning, model interpretability, and cross-species cognitive comparisons. You will gain hands-on experience with fine-tuning language models and deployment of experiments in GPU clusters, designing sophisticated cognitive tasks, and developing state-of-the-art mechanistic interpretability methods. Moreover, your efforts will result in co-authorship on relevant publications and substantial contributions to the broader AI and cognitive science communities.
Related works / Preliminary readings:
- Binz & Schulz (2022). Using cognitive psychology to understand GPT-3
- Goldstein et al. (2025). A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations
- Barbero et al. (2024). Transformers need glasses! Information over-squashing in language tasks
- Wei et al. (2022). Chain-of-thought prompting elicits reasoning in large language models