VLM fine-tuning is currently supported for Qwen 2.5 VL models only.
This guide covers fine-tuning for Vision-Language Models (VLMs) that process both images and text. For fine-tuning text-only models, see our Supervised fine-tuning for text guide.
Supported Models
Currently, VLM fine-tuning supports:- Qwen 2.5 VL 3B Instruct - accounts/fireworks/models/qwen2p5-vl-3b-instruct
- Qwen 2.5 VL 7B Instruct - accounts/fireworks/models/qwen2p5-vl-7b-instruct
- Qwen 2.5 VL 32B Instruct - accounts/fireworks/models/qwen2p5-vl-32b-instruct
- Qwen 2.5 VL 72B Instruct - accounts/fireworks/models/qwen2p5-vl-72b-instruct
Understanding LoRA for VLMs
LoRA significantly reduces the computational and memory requirements for fine-tuning large vision-language models. Instead of updating billions of parameters directly, LoRA learns small “adapter” layers that capture the changes needed for your specific task. Key benefits of LoRA for VLMs:- Efficiency: Requires significantly less memory and compute than full fine-tuning
- Speed: Faster training times while maintaining high-quality results
- Flexibility: Up to 100 LoRA adaptations can run simultaneously on a dedicated deployment
- Cost-effective: Lower training costs compared to full parameter fine-tuning
Fine-tuning Pricing
VLM fine-tuning pricing is based on actual usage:- Training: You pay per token of training data used during the VLM fine-tuning process - see pricing page for cost per token.
- Image processing: Images are tokenized based on resolution and model - typically 1,000-2,500 tokens per image (see this FAQ for more details)
VLM fine-tuning may cost more than text-only fine-tuning due to the additional tokens for processing images alongside text.
Optimize image sizes: Smaller images use fewer tokens during training and inference
and therefore will be faster and cheaper.
You should use the smallest image size that still provides enough detail for your task.
Fine-tuning a VLM using LoRA
1
Prepare your vision dataset
vision datasets must be in JSONL format in OpenAI-compatible chat format.
Each line represents a complete training example.Dataset Requirements:You can use the following script to automatically convert your dataset to the correct format:
Once downloaded, you can upload this dataset using the instructions in the next step.
- Format:
.jsonl
file - Minimum examples: 3
- Maximum examples: 3 million per dataset
- Images: Must be base64 encoded with proper MIME type prefixes
- Supported image formats: PNG, JPG, JPEG
messages
array where each message has:role
: one ofsystem
,user
, orassistant
content
: an array containing text and image objects or just text
Basic VLM Dataset Example
If your dataset contains image urls
Images must be base64 encoded with MIME type prefixes. If your dataset contains image urls, you will need to download and encode them to base64.❌ Incorrect Format - This will NOT work:Raw HTTP/HTTPS URLs are not supported. Images must be base64 encoded.
✅ Correct Format - Use this instead:Notice the
data:image/jpeg;base64,
prefix followed by the base64 encoded image data.Python script to download and encode images to base64
Python script to download and encode images to base64
Usage:
Advanced Dataset Examples
Multi-image Conversation
Multi-turn Conversation
Try with an Example Dataset
To get a feel for how VLM fine-tuning works, you can use an example vision dataset:This is a classification dataset that contains images of food with
<think></think>
tags for reasoning.2
Upload your VLM dataset
Upload your prepared JSONL dataset to Fireworks for training:
For larger datasets (>500MB), use
firectl
as it handles large uploads more reliably than the web interface.For enhanced data control and security, we also support bring your own bucket (BYOB) configurations. See our External GCS Bucket Integration guide for setup details.
3
Launch VLM fine-tuning job
Create a supervised fine-tuning job for your VLM:For additional parameters like learning rates, evaluation datasets, and batch sizes, see Additional SFT job settings.
VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity.
4
Monitor training progress
Track your VLM fine-tuning job in the Fireworks console.
Monitor key metrics:

- Training loss: Should generally decrease over time
- Evaluation loss: Monitor for overfitting if using evaluation dataset
- Training progress: Epochs completed and estimated time remaining
Your VLM fine-tuning job is complete when the status shows
COMPLETED
and your custom model is ready for deployment.5
Deploy your fine-tuned VLM
Once training is complete, deploy your custom VLM:
Advanced Configuration
For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the Additional SFT job settings section in our comprehensive fine-tuning guide.Interactive Tutorial: Fine-tuning VLMs with Google Colab
For a hands-on, step-by-step walkthrough of VLM fine-tuning, we’ve created an interactive Google Colab notebook that demonstrates the complete process from dataset preparation to model deployment.VLM Fine-tuning Tutorial
Google Colab Notebook: Fine-tune Qwen2.5 VL on Fireworks AIThis comprehensive tutorial covers:
- Setting up your environment with Fireworks CLI
- Preparing vision datasets in the correct format
- Launching and monitoring VLM fine-tuning jobs
- Testing your fine-tuned model
- Best practices for VLM fine-tuning
The Colab notebook includes practical examples and can be run directly in your browser. It’s an excellent way to get started with VLM fine-tuning before setting up your own local environment.
Testing Your Fine-tuned VLM
After deployment, test your fine-tuned VLM using the same API patterns as base VLMs:If you fine-tuned using the example dataset, your model should include
<think></think>
tags in its response.