Language models in the wild

Start Playing

The G1Bbon benchmark is based on a set of temporal reasoning tasks, designed to test decision-making capabilities across human players, optimal agents and language models. The ultimate goal is to drive the understanding of the parallels between biological and artificial intelligence by uncovering interpretable mechanisms that underlie temporal reasoning and pattern recognition, while also providing a platform for evaluating the performance of language models and improving their agentic capabilities under stochastic environments.