Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets
Prepare your vision dataset
.jsonl
filemessages
array where each message has:role
: one of system
, user
, or assistant
content
: an array containing text and image objects or just textdata:image/jpeg;base64,
prefix followed by the base64 encoded image data.Python script to download and encode images to base64
<think></think>
tags for reasoning.Upload your VLM dataset
firectl
as it handles large uploads more reliably than the web interface.Launch VLM fine-tuning job
Monitor training progress
COMPLETED
and your custom model is ready for deployment.Deploy your fine-tuned VLM
<think></think>
tags in its response.