llm-colosseum - Real-time Evaluation of LLM Capabilities Using Street Fighter III Tournaments

Project Overview: LLM-Colosseum

The LLM-Colosseum project takes an innovative approach to evaluating Large Language Models (LLMs) by testing their capabilities in real-time battles within the classic video game, Street Fighter III. This project offers a novel benchmark for assessing LLM performance by simulating a competitive gaming environment where these models face off against each other.

Objectives and Criteria

The primary goal of the LLM-Colosseum is to discover which LLM can claim the title of the best virtual fighter. To achieve this, several key criteria are used to evaluate the performance of each LLM:

Speed: Quick decision-making is crucial in real-time gaming environments.
Intelligence: Effective models plan multiple steps ahead in their strategy.
Creativity: Utilizing unconventional tactics can provide an edge over competitors.
Adaptability: Successful models learn from past mistakes and adjust their strategies accordingly.
Resilience: Maintaining high performance consistency throughout the game is rewarded.

Battle Scenarios

The LLM-Colosseum sets up different battle scenarios to test these capabilities. A typical setup involves a one-on-one match, such as Mistral 7B vs. Mistral 7B, which is repeated in multiple iterations (e.g., six simultaneous matches) to gather comprehensive data about the models’ performance.

Benchmarking LLMs

Street Fighter III serves as the testing ground for LLMs, challenging them to interpret game contexts and make informed decisions unlike reinforcement learning models that might act blindly based on reward systems. Instead, LLMs are expected to have a nuanced understanding of the gaming environment, adjusting their actions based on in-game developments.

Experimentation and Results

The LLM-Colosseum has conducted 342 matches thus far, resulting in a detailed leaderboard ranked by ELO scores. Notably, the leaderboard is topped by models like openai:gpt-3.5-turbo-0125, indicating superior performance in this synthetic environment:

🥇 openai:gpt-3.5-turbo-0125: ELO 1776.11
🥈 mistral:mistral-small-latest: ELO 1586.16
🥉 openai:gpt-4-1106-preview: ELO 1584.78

Technical Aspects

Each model controls a player in the game through a text-based description of the game screen. Based on the input, the LLMs decide the next set of actions for their characters, factoring in both their prior moves and the opponents' actions, along with in-game statistics like health and power bars.

The project utilizes advanced techniques such as agent-based control, multithreading, and real-time processing to ensure an authentic gaming experience.

Installation and Usage

For those interested in running the LLM-Colosseum project, installation and execution instructions are provided. The setup involves a series of steps such as downloading necessary ROMs, establishing a Python virtual environment, and using Docker for containerized application deployment.

Options for customizing the model or prompts are available, allowing users to modify the robot's decision-making process by interacting with the Robot.call_llm() function. Advanced users can also submit new variations of models for consideration through a pull request.

Conclusion

The LLM-Colosseum project represents a groundbreaking effort to evaluate LLMs in a dynamic and competitive setting. It not only highlights the strengths and weaknesses of different models but also offers insights into how AI technology can parallel human decision-making in complex scenarios. Developed by the OpenGenerativeAI team and collaborators, the project exemplifies a creative fusion of gaming and AI evaluation.