can-ai-code - Refine AI Code Assessment with Interactive Interviews and Performance Analysis

Can AI Code?

A cute robot working on a laptop

Can AI Code? is a fascinating project designed to explore the capacities of Artificial Intelligence (AI) in coding, using human-written interview questions to evaluate AI's coding capabilities. These tests are performed in a specially curated environment to ensure comprehensive and thorough evaluation processes.

Key Ideas

The main features of Can AI Code? include:

Human-Crafted Interview Questions: The project uses sets of interview questions designed by humans to test AI's coding abilities.
Inference Scripts: These scripts work with all standard API providers and CUDA-enabled quantization runtimes to facilitate the testing.
Sandbox Environment: A Docker-based environment is used to safely assess untrusted Python and NodeJS code.
Evaluation of Prompting and Sampling: It explores the effects of different prompting techniques and sampling parameters on the performance of Large Language Models (LLMs) in coding tasks.
Assessment of Quantization Impact: It also examines how coding performance might degrade due to quantization.

Recent Updates

October 26: Evaluation of Qwen2.5 for various models and the update of evaluations for OpenAI, Mistral, and Anthropic models.
October 25: Assessment of the IBM-Grainite/Grainite-3.0 family, marking a return to evaluation activities after a brief hiatus.
September 12: Correction of a serialization bug in the evaluation process.
September 11: Evaluation of the powerful Yi-Coder models.

Test Suites

Junior-v2: A comprehensive suite of 12 tests for small LLM coding, involving multi-language support for Python and JavaScript.
Humaneval: A service comprising 164 Python-only tests, originally created by OpenAI, with templates and evaluation scripts provided for operational ease.

View the Leaderboard | View Comparisons

Results and Data

The repository contains all the model answers and evaluation results, accessible through applications built on Streamlit. This enables both exploring results and running web apps locally.

Repository Structure

Interviews: For both junior and senior coding levels, with specific .yaml files for interview questions.
Prepare: Includes LLM prompt templates and scripts to convert questions into specific prompts.
Interview and Evaluate: Scripts and applications to conduct the interviews and evaluate the generated code.
Comparison Tools: Scripts exist to compare evaluation results, with optional calls to LLM for deeper analysis.

Interview Methods

The project supports various API and CUDA runtime configurations for conducting interviews, with scripts tailored to different platforms like OpenAI, KoboldCpp, Huggingface, and others.

Modal Running

The recommended setup uses a CUDA11.8 based container, although an alternative CUDA12 script is available with limited compatibility. Users can configure the script for specific models and runtimes.

Question Formats

Questions are stored in .yaml files, each containing fields needed for preparation and evaluation. These fields ensure each test is useful and understandable.

Future Work

Plans include developing a test suite for senior coders and expanding model requests, reflecting the project's dedication to continued growth and exploration in AI capabilities in coding.

This overview highlights the project's innovative approach to combining AI with coding in a structured and evaluative context, illuminating both the challenges and potential of AI in various software development tasks.