What’s the latency for small, medium, and large LLM models?
Model latency and performance depend on various factors:
Input/output prompt lengths
Model quantization
Model sharding
Disaggregated prefill processes
Hardware configuration
Multiple layers of caching
Fire optimizations
LoRA adapters (Low-Rank Adaptation)
Our team specializes in personalizing model performance. We work with you to understand your traffic patterns and create customized deployment templates that maximize performance for your use case.