LLM Game Bench
Benchmark AI against classic video games
The first open-source framework for evaluating AI visual understanding, reasoning, and decision-making through classic game environments.
HOW IT WORKS
This project uses a combination of components to allow an AI to play Pokémon Red:
Emulator with Lua Script
The mGBA emulator runs Pokémon Red and uses a Lua script to capture screenshots, send them to the controller, receive button press commands, and execute them in the game.
Python Controller
Acts as a bridge between the emulator and the LLM. Manages screenshots, notepad (game memory), and sends commands back to the emulator.
LLM Provider
The "brain" of the system that analyzes screenshots, makes decisions about what to do next, and keeps track of its progress and goals in the notepad.
All benchmarks are reproducible and open-source. Run them yourself or contribute to the project!
LATEST UPDATES
Project Launch: LLM Game Bench
We're excited to announce the launch of LLM Game Bench, starting with our Pokémon Red benchmark. Learn how we're testing AI visual understanding and decision-making.
Read More →