Supervised Fine Tuning - Vision

Vision-language model (VLM) fine-tuning allows you to adapt pre-trained models that can understand both text and images to your specific use cases. This is particularly valuable for tasks like document analysis, visual question answering, image captioning, and domain-specific visual understanding. To see all vision models that support fine-tuning, visit the Model Library for vision models.

Fine-tuning a VLM using LoRA

Prepare your vision dataset

vision datasets must be in JSONL format in OpenAI-compatible chat format. Each line represents a complete training example.Dataset Requirements:

Format: .jsonl file
Minimum examples: 3
Maximum examples: 3 million per dataset
Images: Must be base64 encoded with proper MIME type prefixes
Supported image formats: PNG, JPG, JPEG

Message Schema: Each training example must include a messages array where each message has:

role: one of system, user, or assistant
content: an array containing text and image objects or just text

Basic VLM Dataset Example

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful visual assistant that can analyze images and answer questions about them."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What objects do you see in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "I can see a red car, a tree, and a blue house in this image."
    }
  ]
}

If your dataset contains image urls

Images must be base64 encoded with MIME type prefixes. If your dataset contains image URLs, you’ll need to download and encode them to base64.

❌ Incorrect
✅ Correct

{
  "type": "image_url",
  "image_url": {
    // ❌ Raw HTTP/HTTPS URLs are NOT supported
    "url": "https://example.com/image.jpg"
  }
}

You can use the following script to automatically convert your dataset to the correct format:

Python script to download and encode images to base64

Usage:

# Install required dependency
pip install requests

# Download the script
wget https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/utils/download_images_and_encode_to_b64.py

# Run the script - will output a new dataset <path_to_your_dataset>_base64.jsonl
python download_images_and_encode_to_b64.py --input_file <path_to_your_dataset.jsonl>

Advanced Dataset Examples

Multi-image Conversation
Multi-turn Conversation

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Compare these two images and tell me the differences"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
          }
        },
        {
          "type": "image_url", 
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "The first image shows a daytime scene while the second shows the same location at night. The lighting and shadows are completely different."
    }
  ]
}

Try with an Example Dataset

To get a feel for how VLM fine-tuning works, you can use an example vision dataset. This is a classification dataset that contains images of food with <think></think> tags for reasoning.

Download with curl
Download with wget

# Download the example dataset
curl -L -o food_reasoning.jsonl https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl

Upload your VLM dataset

Upload your prepared JSONL dataset to Fireworks for training:

firectl
UI
REST API

firectl create dataset my-vlm-dataset /path/to/vlm_training_data.jsonl

For larger datasets (>500MB), use firectl as it handles large uploads more reliably than the web interface. For enhanced data control and security, we also support bring your own bucket (BYOB) configurations. See our Secure Fine Tuning guide for setup details.

Launch VLM fine-tuning job

Create a supervised fine-tuning job for your VLM:

firectl
UI

firectl create sftj \
  --base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \
  --dataset my-vlm-dataset \
  --output-model my-custom-vlm \
  --epochs 3

For additional parameters like learning rates, evaluation datasets, and batch sizes, see Additional SFT job settings.

VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity.

Monitor training progress

Track your VLM fine-tuning job in the Fireworks console.

Monitor key metrics:

Training loss: Should generally decrease over time
Evaluation loss: Monitor for overfitting if using evaluation dataset
Training progress: Epochs completed and estimated time remaining

Your VLM fine-tuning job is complete when the status shows COMPLETED and your custom model is ready for deployment.

Deploy your fine-tuned VLM

Once training is complete, deploy your custom VLM:

firectl
UI

# Create a deployment for your fine-tuned VLM
firectl create deployment my-custom-vlm

# Check deployment status
firectl get deployment accounts/your-account/deployment/deployment-id

Advanced Configuration

For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the Additional SFT job settings section in our comprehensive fine-tuning guide.

Interactive Tutorials: Fine-tuning VLMs

For a hands-on, step-by-step walkthrough of VLM fine-tuning, we’ve created two fine tuning cookbooks that demonstrates the complete process from dataset preparation, model deployment to evaluation.

VLM Fine-tuning Quickstart

Google Colab Notebook: Fine-tune Qwen2.5 VL on Fireworks AI

VLM Fine-tuning + Evals

Finetuning a VLM to beat SOTA closed source model

The cookbooks above cover the following:

Setting up your environment with Fireworks CLI
Preparing vision datasets in the correct format
Launching and monitoring VLM fine-tuning jobs
Testing your fine-tuned model
Best practices for VLM fine-tuning
Running inference on serverless VLMs
Running evals to show performance gains

Testing Your Fine-tuned VLM

After deployment, test your fine-tuned VLM using the same API patterns as base VLMs:

import openai

client = openai.OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="<FIREWORKS_API_KEY>",
)

response = client.chat.completions.create(
    model="accounts/your-account/models/my-custom-vlm",
    messages=[{
        "role": "user",
        "content": [{
            "type": "image_url",
            "image_url": {
                "url": "https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/images/icecream.jpeg"
            },
        },{
            "type": "text",
            "text": "What's in this image?",
        }],
    }]
)
print(response.choices[0].message.content)

If you fine-tuned using the example dataset, your model should include <think></think> tags in its response.

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Supervised Fine Tuning - Vision

Fine-tuning a VLM using LoRA

Basic VLM Dataset Example

If your dataset contains image urls

Advanced Dataset Examples

Try with an Example Dataset

Advanced Configuration

Interactive Tutorials: Fine-tuning VLMs

VLM Fine-tuning Quickstart

VLM Fine-tuning + Evals

Testing Your Fine-tuned VLM

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Fine-tuning a VLM using LoRA

​Basic VLM Dataset Example

​If your dataset contains image urls

​Advanced Dataset Examples

​Try with an Example Dataset

​Advanced Configuration

​Interactive Tutorials: Fine-tuning VLMs

VLM Fine-tuning Quickstart

VLM Fine-tuning + Evals

​Testing Your Fine-tuned VLM

Fine-tuning a VLM using LoRA

Basic VLM Dataset Example

If your dataset contains image urls

Advanced Dataset Examples

Try with an Example Dataset

Advanced Configuration

Interactive Tutorials: Fine-tuning VLMs

Testing Your Fine-tuned VLM