LLM Game Bench

Benchmark AI against classic video games

The first open-source framework for evaluating AI visual understanding, reasoning, and decision-making through classic game environments.

GITHUB REPO

HOW IT WORKS

This project uses a combination of components to allow an AI to play Pokémon Red:

Emulator with Lua Script

The mGBA emulator runs Pokémon Red and uses a Lua script to capture screenshots, send them to the controller, receive button press commands, and execute them in the game.

Python Controller

Acts as a bridge between the emulator and the LLM. Manages screenshots, notepad (game memory), and sends commands back to the emulator.

LLM Provider

The "brain" of the system that analyzes screenshots, makes decisions about what to do next, and keeps track of its progress and goals in the notepad.

All benchmarks are reproducible and open-source. Run them yourself or contribute to the project!

VIEW CODE ON GITHUB

LATEST UPDATES

March 10, 2025

Project Launch: LLM Game Bench

We're excited to announce the launch of LLM Game Bench, starting with our Pokémon Red benchmark. Learn how we're testing AI visual understanding and decision-making.