#vLLM
vllm
vLLM provides efficient LLM inference and serving solutions with leading-edge throughput and seamless memory management via PagedAttention. It integrates smoothly with popular models and supports diverse hardware platforms and decoding algorithms, ensuring flexible and high-performance deployments. Updates include Llama 3.1 integration, enhanced quantization, and comprehensive support for Hugging Face models. As a community-driven project, vLLM benefits from industry sponsorships, promoting continual improvement through collaboration and feedback.
lm-evaluation-harness
The framework offers a versatile testing ground for generative language models, supporting a broad array of evaluation tasks. Key enhancements include the addition of Open LLM Leaderboard tasks and compatibility with multimodal inputs and APIs, facilitating improved customization and efficiency. It integrates over 60 benchmarks and supports various models, including GPT-NeoX and Megatron-DeepSpeed, with efficient inference using vLLM. The tool is extensively used in research and within organizations such as NVIDIA and Cohere.
Feedback Email: [email protected]