VLM fine-tuning is currently supported for Qwen 2.5 VL models only.
This guide covers fine-tuning for Vision-Language Models (VLMs) that process both images and text. For fine-tuning text-only models, see our Supervised fine-tuning for text guide.
Vision-language model (VLM) fine-tuning allows you to adapt pre-trained models that can understand both text and images to your specific use cases. This is particularly valuable for tasks like document analysis, visual question answering, image captioning, and domain-specific visual understanding. This guide shows you how to fine-tune VLMs on Fireworks AI using LoRA (Low-Rank Adaptation) with datasets containing both images and text.

Supported Models

Currently, VLM fine-tuning supports:

Understanding LoRA for VLMs

LoRA significantly reduces the computational and memory requirements for fine-tuning large vision-language models. Instead of updating billions of parameters directly, LoRA learns small “adapter” layers that capture the changes needed for your specific task. Key benefits of LoRA for VLMs:
  • Efficiency: Requires significantly less memory and compute than full fine-tuning
  • Speed: Faster training times while maintaining high-quality results
  • Flexibility: Up to 100 LoRA adaptations can run simultaneously on a dedicated deployment
  • Cost-effective: Lower training costs compared to full parameter fine-tuning

Fine-tuning Pricing

VLM fine-tuning pricing is based on actual usage:
  • Training: You pay per token of training data used during the VLM fine-tuning process - see pricing page for cost per token.
  • Image processing: Images are tokenized based on resolution and model - typically 1,000-2,500 tokens per image (see this FAQ for more details)
VLM fine-tuning may cost more than text-only fine-tuning due to the additional tokens for processing images alongside text.
Optimize image sizes: Smaller images use fewer tokens during training and inference and therefore will be faster and cheaper. You should use the smallest image size that still provides enough detail for your task.

Fine-tuning a VLM using LoRA

1

Prepare your vision dataset

vision datasets must be in JSONL format in OpenAI-compatible chat format. Each line represents a complete training example.Dataset Requirements:
  • Format: .jsonl file
  • Minimum examples: 3
  • Maximum examples: 3 million per dataset
  • Images: Must be base64 encoded with proper MIME type prefixes
  • Supported image formats: PNG, JPG, JPEG
Message Schema: Each training example must include a messages array where each message has:
  • role: one of system, user, or assistant
  • content: an array containing text and image objects or just text

Basic VLM Dataset Example

{"messages": [{"role": "system", "content": "You are a helpful visual assistant that can analyze images and answer questions about them."}, {"role": "user", "content": [{"type": "text", "text": "What objects do you see in this image?"}, {"type": "image_url", "image_url": {"url": "..."}}]}, {"role": "assistant", "content": "I can see a red car, a tree, and a blue house in this image."}]}

If your dataset contains image urls

Images must be base64 encoded with MIME type prefixes. If your dataset contains image urls, you will need to download and encode them to base64.
❌ Incorrect Format - This will NOT work:
{"messages": [{"role": "user", "content": [{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, {"type": "text", "text": "What's in this image?"}]}, {"role": "assistant", "content": "I can see..."}]}
Raw HTTP/HTTPS URLs are not supported. Images must be base64 encoded.
✅ Correct Format - Use this instead:
{"messages": [{"role": "user", "content": [{"type": "image_url", "image_url": {"url": "..."}}, {"type": "text", "text": "What's in this image?"}]}, {"role": "assistant", "content": "I can see..."}]}
Notice the data:image/jpeg;base64, prefix followed by the base64 encoded image data.
You can use the following script to automatically convert your dataset to the correct format:

Advanced Dataset Examples

Multi-image Conversation

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Compare these two images and tell me the differences"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "..."
          }
        },
        {
          "type": "image_url", 
          "image_url": {
            "url": "..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "The first image shows a daytime scene while the second shows the same location at night. The lighting and shadows are completely different."
    }
  ]
}

Multi-turn Conversation

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful visual assistant that can analyze images and remember details from previous images in our conversation."
    },
    {
      "role": "user", 
      "content": [
        {
          "type": "text",
          "text": "Can you describe this kitchen layout for me?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "This is a modern open-plan kitchen with white cabinets, granite countertops, and stainless steel appliances. The island is positioned in the center with bar seating for 3-4 people. There's a large window above the sink providing natural light."
    },
    {
      "role": "user",
      "content": "Now look at this living room. Do you think the styles would work well together?"
    },
    {
      "role": "assistant", 
      "content": "I'd be happy to help compare the styles! However, I don't see a living room image in your message. Could you please share the living room photo so I can analyze how well it would coordinate with the modern kitchen style we just discussed?"
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Sorry about that! Here's the living room:"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "Perfect! This living room has a complementary modern aesthetic that would work beautifully with the kitchen. Both spaces feature clean lines and a neutral color palette. The living room's contemporary furniture and the kitchen's sleek design would create a cohesive flow in an open floor plan. The warm wood accents in the living room would add nice contrast to the kitchen's cooler tones."
    },
  ]
}

Try with an Example Dataset

To get a feel for how VLM fine-tuning works, you can use an example vision dataset:
# Download the example dataset
curl -L -o food_reasoning.jsonl https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl
This is a classification dataset that contains images of food with <think></think> tags for reasoning.
Once downloaded, you can upload this dataset using the instructions in the next step.
2

Upload your VLM dataset

Upload your prepared JSONL dataset to Fireworks for training:
firectl create dataset my-vlm-dataset /path/to/vlm_training_data.jsonl
For larger datasets (>500MB), use firectl as it handles large uploads more reliably than the web interface.
3

Launch VLM fine-tuning job

Create a supervised fine-tuning job for your VLM:
firectl create sftj \
  --base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \
  --dataset my-vlm-dataset \
  --output-model my-custom-vlm \
  --epochs 3
For additional parameters like learning rates, evaluation datasets, and batch sizes, see Additional SFT job settings.
VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity.
4

Monitor training progress

Track your VLM fine-tuning job in the Fireworks console.
VLM fine-tuning job in the Fireworks console
Monitor key metrics:
  • Training loss: Should generally decrease over time
  • Evaluation loss: Monitor for overfitting if using evaluation dataset
  • Training progress: Epochs completed and estimated time remaining
Your VLM fine-tuning job is complete when the status shows COMPLETED and your custom model is ready for deployment.
5

Deploy your fine-tuned VLM

Once training is complete, deploy your custom VLM:
# Create a deployment for your fine-tuned VLM
firectl create deployment my-custom-vlm

# Check deployment status
firectl get deployment accounts/your-account/deployment/deployment-id

Advanced Configuration

For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the Additional SFT job settings section in our comprehensive fine-tuning guide.

Testing Your Fine-tuned VLM

After deployment, test your fine-tuned VLM using the same API patterns as base VLMs:
import openai

client = openai.OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="<FIREWORKS_API_KEY>",
)

response = client.chat.completions.create(
    model="accounts/your-account/models/my-custom-vlm",
    messages=[{
        "role": "user",
        "content": [{
            "type": "image_url",
            "image_url": {
                "url": "https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/images/icecream.jpeg"
            },
        },{
            "type": "text",
            "text": "What's in this image?",
        }],
    }]
)
print(response.choices[0].message.content)
If you fine-tuned using the example dataset, your model should include <think></think> tags in its response.