Use this file to discover all available pages before exploring further.
Vision-language models (VLMs) process both text and images in a single request, enabling image captioning, visual question answering, document analysis, chart interpretation, OCR, and content moderation. Use VLMs via serverless inference or dedicated deployments.Browse available vision models →
Vision-language models support prompt caching to improve performance for requests with repeated content. Both text and image portions can benefit from caching to reduce time to first token by up to 80%.Tips for optimal performance:
Use URLs for long conversations – Reduces latency compared to base64 encoding
Downsize images – Smaller images use fewer tokens and process faster
Structure prompts for caching – Place static instructions at the beginning, variable content at the end
Include metadata in prompts – Add context about the image directly in your text prompt
VLMs do not natively accept PDF files as input. To analyze PDF documents, convert each page to an image and pass the images to the model using base64 encoding.
Remember the 30-image limit per request. For long documents, process pages in batches or select only the relevant pages.
For the Completions API, manually insert the image token <image> in your prompt and supply images as an ordered list:
response = client.completions.create( model="accounts/fireworks/models/kimi-k2p5", prompt="SYSTEM: Hello\n\nUSER:<image>\ntell me about the image\n\nASSISTANT:", extra_body={ "images": ["https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"] })print(response.choices[0].text)