text-generation-inference
Text Generation Inference facilitates the efficient deployment of Large Language Models like Llama and GPT-NeoX. It enhances performance with features such as Tensor Parallelism and token streaming, supporting hardware from Nvidia to Google TPU. Key optimizations include Flash Attention and quantization. It also supports customization options and distributed tracing for robust production use.