Can AI Code?
Can AI Code? is a fascinating project designed to explore the capacities of Artificial Intelligence (AI) in coding, using human-written interview questions to evaluate AI's coding capabilities. These tests are performed in a specially curated environment to ensure comprehensive and thorough evaluation processes.
Key Ideas
The main features of Can AI Code? include:
- Human-Crafted Interview Questions: The project uses sets of interview questions designed by humans to test AI's coding abilities.
- Inference Scripts: These scripts work with all standard API providers and CUDA-enabled quantization runtimes to facilitate the testing.
- Sandbox Environment: A Docker-based environment is used to safely assess untrusted Python and NodeJS code.
- Evaluation of Prompting and Sampling: It explores the effects of different prompting techniques and sampling parameters on the performance of Large Language Models (LLMs) in coding tasks.
- Assessment of Quantization Impact: It also examines how coding performance might degrade due to quantization.
Recent Updates
- October 26: Evaluation of Qwen2.5 for various models and the update of evaluations for OpenAI, Mistral, and Anthropic models.
- October 25: Assessment of the IBM-Grainite/Grainite-3.0 family, marking a return to evaluation activities after a brief hiatus.
- September 12: Correction of a serialization bug in the evaluation process.
- September 11: Evaluation of the powerful Yi-Coder models.
Test Suites
- Junior-v2: A comprehensive suite of 12 tests for small LLM coding, involving multi-language support for Python and JavaScript.
- Humaneval: A service comprising 164 Python-only tests, originally created by OpenAI, with templates and evaluation scripts provided for operational ease.
View the Leaderboard | View Comparisons
Results and Data
The repository contains all the model answers and evaluation results, accessible through applications built on Streamlit. This enables both exploring results and running web apps locally.
Repository Structure
- Interviews: For both junior and senior coding levels, with specific .yaml files for interview questions.
- Prepare: Includes LLM prompt templates and scripts to convert questions into specific prompts.
- Interview and Evaluate: Scripts and applications to conduct the interviews and evaluate the generated code.
- Comparison Tools: Scripts exist to compare evaluation results, with optional calls to LLM for deeper analysis.
Interview Methods
The project supports various API and CUDA runtime configurations for conducting interviews, with scripts tailored to different platforms like OpenAI, KoboldCpp, Huggingface, and others.
Modal Running
The recommended setup uses a CUDA11.8 based container, although an alternative CUDA12 script is available with limited compatibility. Users can configure the script for specific models and runtimes.
Question Formats
Questions are stored in .yaml files, each containing fields needed for preparation and evaluation. These fields ensure each test is useful and understandable.
Future Work
Plans include developing a test suite for senior coders and expanding model requests, reflecting the project's dedication to continued growth and exploration in AI capabilities in coding.
This overview highlights the project's innovative approach to combining AI with coding in a structured and evaluative context, illuminating both the challenges and potential of AI in various software development tasks.