Project Icon

punica

Efficient Serving of Multiple Finetuned Models with Minimal Overhead

Product DescriptionDiscover Punica, a novel solution for serving multiple LoRA finetuned models with only 1% additional memory overhead, utilizing a special CUDA kernel for efficient computation. Achieve up to 12x throughput boosts compared to leading systems using segmentation gathering techniques. Punica is available via binaries or source code to match your configuration needs, with comprehensive examples and benchmarks provided.
Project Details