Vision Models

Vision-language models (VLMs) process both text and images in a single request, enabling image captioning, visual question answering, document analysis, chart interpretation, OCR, and content moderation. Use VLMs via serverless inference or dedicated deployments. Browse available vision models →

Chat Completions API

Provide images via URL or base64 encoding. The request structure is identical to OpenAI’s vision API.

Python
JavaScript
curl

from fireworks import Fireworks

client = Fireworks()

response = client.chat.completions.create(
    model="accounts/fireworks/models/kimi-k2p5",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Can you describe this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?w=800"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

You can also use the OpenAI SDK with Fireworks by changing the base URL and API key.

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.FIREWORKS_API_KEY,
  baseURL: "https://api.fireworks.ai/inference/v1",
});

const response = await client.chat.completions.create({
  model: "accounts/fireworks/models/kimi-k2p5",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Can you describe this image?" },
        {
          type: "image_url",
          image_url: {
            url: "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?w=800"
          }
        }
      ]
    }
  ]
});

console.log(response.choices[0].message.content);

curl https://api.fireworks.ai/inference/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $FIREWORKS_API_KEY" \
  -d '{
    "model": "accounts/fireworks/models/kimi-k2p5",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Can you describe this image?"},
          {
            "type": "image_url",
            "image_url": {
              "url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?w=800"
            }
          }
        ]
      }
    ]
  }'

Using base64-encoded images

For local files, encode them as base64 with the appropriate MIME type prefix:

Python
JavaScript

import base64
from fireworks import Fireworks

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

image_base64 = encode_image("your_image.jpg")

client = Fireworks()

response = client.chat.completions.create(
    model="accounts/fireworks/models/kimi-k2p5",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Can you describe this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_base64}"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

import OpenAI from "openai";
import fs from "fs";

const client = new OpenAI({
  apiKey: process.env.FIREWORKS_API_KEY,
  baseURL: "https://api.fireworks.ai/inference/v1",
});

const imageBase64 = fs.readFileSync("your_image.jpg").toString("base64");

const response = await client.chat.completions.create({
  model: "accounts/fireworks/models/kimi-k2p5",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Can you describe this image?" },
        {
          type: "image_url",
          image_url: {
            url: `data:image/jpeg;base64,${imageBase64}`
          }
        }
      ]
    }
  ]
});

console.log(response.choices[0].message.content);

Working with images

Vision-language models support prompt caching to improve performance for requests with repeated content. Both text and image portions can benefit from caching to reduce time to first token by up to 80%. Tips for optimal performance:

Use URLs for long conversations – Reduces latency compared to base64 encoding
Downsize images – Smaller images use fewer tokens and process faster
Structure prompts for caching – Place static instructions at the beginning, variable content at the end
Include metadata in prompts – Add context about the image directly in your text prompt

Working with PDFs

VLMs do not natively accept PDF files as input. To analyze PDF documents, convert each page to an image and pass the images to the model using base64 encoding.

Remember the 30-image limit per request. For long documents, process pages in batches or select only the relevant pages.

Python
JavaScript

Install PyMuPDF:

pip install pymupdf fireworks-ai

import base64
import fitz
from fireworks.client import Fireworks


def pdf_pages_to_base64(pdf_path, dpi=200):
    doc = fitz.open(pdf_path)
    images = []
    for page in doc:
        pix = page.get_pixmap(dpi=dpi)
        images.append(base64.b64encode(pix.tobytes("png")).decode("utf-8"))
    doc.close()
    return images


page_images = pdf_pages_to_base64("document.pdf")

client = Fireworks()

content = [{"type": "text", "text": "Summarize this document."}]
for img in page_images:
    content.append({
        "type": "image_url",
        "image_url": {"url": f"data:image/png;base64,{img}"}
    })

response = client.chat.completions.create(
    model="accounts/fireworks/models/kimi-k2p5",
    messages=[{"role": "user", "content": content}]
)

print(response.choices[0].message.content)

Install pdf-to-img and openai:

npm install pdf-to-img openai

import { pdf } from "pdf-to-img";
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.FIREWORKS_API_KEY,
  baseURL: "https://api.fireworks.ai/inference/v1",
});

const pages = [];
for await (const page of await pdf("document.pdf", { scale: 2.0 })) {
  pages.push(Buffer.from(page).toString("base64"));
}

const content = [
  { type: "text", text: "Summarize this document." },
  ...pages.map((base64) => ({
    type: "image_url",
    image_url: { url: `data:image/png;base64,${base64}` },
  })),
];

const response = await client.chat.completions.create({
  model: "accounts/fireworks/models/kimi-k2p5",
  messages: [{ role: "user", content }],
});

console.log(response.choices[0].message.content);

Advanced capabilities

Vision fine-tuning

Fine-tune VLMs for specialized visual tasks

LoRA adapters

Deploy custom LoRA adapters for vision models

Dedicated deployments

Deploy VLMs on dedicated GPUs for better performance

Video & audio inputs

Process video and audio content with multimodal models

Alternative query methods

For the Completions API, manually insert the image token <image> in your prompt and supply images as an ordered list:

response = client.completions.create(
    model="accounts/fireworks/models/kimi-k2p5",
    prompt="SYSTEM: Hello\n\nUSER:<image>\ntell me about the image\n\nASSISTANT:",
    extra_body={
        "images": ["https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"]
    }
)

print(response.choices[0].text)

Known limitations

Maximum images per request: 30 images maximum, regardless of format (base64 or URL)
Base64 size limit: Total base64-encoded images must be less than 10MB
URL size and timeout: Each image URL must be smaller than 5MB and download within 1.5 seconds
Supported formats: .png, .jpg, .jpeg, .gif, .bmp, .tiff, .ppm
Llama 3.2 Vision models: Pass images before text in the content field to avoid refusals (temporary limitation)

Get Started

Serverless

Deployments

Models & Inference

Fine Tuning

Fire Pass

Administration

Security & Compliance

Integrations

Chat Completions API

Using base64-encoded images

Working with images

Working with PDFs

Advanced capabilities

Vision fine-tuning

LoRA adapters

Dedicated deployments

Video & audio inputs

Alternative query methods

Known limitations

Get Started

Serverless

Deployments

Models & Inference

Fine Tuning

Fire Pass

Administration

Security & Compliance

Integrations

Documentation Index

​Chat Completions API

​Using base64-encoded images

​Working with images

​Working with PDFs

​Advanced capabilities

Vision fine-tuning

LoRA adapters

Dedicated deployments

Video & audio inputs

​Alternative query methods

​Known limitations

Chat Completions API

Using base64-encoded images

Working with images

Working with PDFs

Advanced capabilities

Alternative query methods

Known limitations