simple-evals
This repository provides a lightweight library for transparent evaluations of language models, emphasizing zero-shot and chain-of-thought methods. It includes benchmark results for models such as GPT-4, using tests like MMLU and HumanEval. The library favors simple, realistic instructions over complex prompting to better gauge real-world performance. While not actively maintained, it allows for updates such as bug fixes and new models. The setup supports OpenAI and Anthropic APIs for efficient, adaptable evaluations.