Some Omni/multimodal models can process audio and/or video inputs directly, enabling video captioning, scene analysis, content understanding, and multimodal question answering. A good example is Qwen3 Omni (qwen3-omni-30b-a3b-instruct), which supports video, audio, and text inputs in a single request. Deploy these models using dedicated deployments for production workloads.
Available models
Model Input support Notes Qwen3 Omni 30B A3B Instruct Video, audio, text Dedicated deployment required Molmo2-4B Video, text Dedicated deployment required Molmo2-8B Video, text Dedicated deployment required
Qwen3 Omni supports native video and audio inputs. Molmo2 models are video-only, so use the same request structure as below, but omit audio_url. Molmo2 models cannot understand audio from videos.
Create a deployment
Video and audio models require dedicated deployments. Create one using firectl:
firectl deployment create qwen3-omni-30b-a3b-instruct \
--account-id < YOUR_ACCOUNT_I D > \
--min-replica-count 1 \
--max-replica-count 1 \
--deployment-shape qwen3-omni-30b-a3b-instruct-minimal
Make sure to use the predefined qwen3-omni-30b-a3b-instruct-minimal deployment shape for your deployment to work correctly.
Chat Completions API
Provide video and audio as base64-encoded data URLs. The model accepts video_url, audio_url, and text content types.
import os
import base64
import requests
# Load and encode your preprocessed video and audio
with open ( "processed_video.mp4" , "rb" ) as f:
video_b64 = base64.b64encode(f.read()).decode( "utf-8" )
with open ( "audio.ogg" , "rb" ) as f:
audio_b64 = base64.b64encode(f.read()).decode( "utf-8" )
# API configuration
url = "https://api.fireworks.ai/inference/v1/chat/completions"
headers = {
"Content-Type" : "application/json" ,
"Authorization" : f "Bearer { os.environ[ 'FIREWORKS_API_KEY' ] } " ,
}
# Request payload
payload = {
"model" : "accounts/<YOUR_ACCOUNT_ID>/models/qwen3-omni-30b-a3b-instruct#accounts/<YOUR_ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>" ,
"max_tokens" : 1000 ,
"temperature" : 0.3 ,
"messages" : [
{
"role" : "user" ,
"content" : [
{ "type" : "video_url" , "video_url" : { "url" : f "data:video/mp4;base64, { video_b64 } " }},
{ "type" : "audio_url" , "audio_url" : { "url" : f "data:audio/ogg;base64, { audio_b64 } " }},
{ "type" : "text" , "text" : "Describe what happens in this video." },
],
},
],
}
# Send request
response = requests.post(url, headers = headers, json = payload)
print (response.json()[ "choices" ][ 0 ][ "message" ][ "content" ])
# Encode your files (run these separately)
VIDEO_B64 = $( base64 -i processed_video.mp4 )
AUDIO_B64 = $( base64 -i audio.ogg )
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY " \
-d '{
"model": "accounts/<YOUR_ACCOUNT_ID>/models/qwen3-omni-30b-a3b-instruct#accounts/<YOUR_ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
"max_tokens": 1000,
"temperature": 0.3,
"messages": [
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": "data:video/mp4;base64,' $VIDEO_B64 '"}},
{"type": "audio_url", "audio_url": {"url": "data:audio/ogg;base64,' $AUDIO_B64 '"}},
{"type": "text", "text": "Describe what happens in this video."}
]
}
]
}'
Working with videos
Video models perform best with preprocessed inputs that balance quality and token efficiency. Use ffmpeg to optimize your video and audio before sending requests.
Preprocessing video
Extract frames at 1 FPS and downscale to 360p for efficient processing:
ffmpeg -y -i input_video.mp4 \
-t 60 \
-vf "fps=1,scale=-1:360" \
-c:v libx264 -preset fast \
-an \
processed_video.mp4
Parameter Description -t 60Limit to first 60 seconds fps=1Extract 1 frame per second scale=-1:360Downscale to 360p height, maintain aspect ratio -anRemove audio track (extracted separately)
Preprocessing audio
Extract audio as Opus in an Ogg container for optimal compression:
ffmpeg -y -i input_video.mp4 \
-t 60 \
-vn \
-c:a libopus \
-b:a 24k \
-ar 16000 \
-ac 1 \
audio.ogg
Parameter Description -t 60Limit to first 60 seconds -vnRemove video track -c:a libopusUse Opus codec -b:a 24k24 kbps bitrate -ar 1600016 kHz sample rate -ac 1Mono audio
Complete preprocessing example
import subprocess
import tempfile
import base64
import os
def preprocess_video ( video_path : str ) -> tuple[ str , str ]:
"""
Preprocess video for optimal model input.
Returns:
Tuple of (video_base64, audio_base64)
"""
with tempfile.NamedTemporaryFile( suffix = ".mp4" , delete = False ) as tmp_video:
processed_video_path = tmp_video.name
with tempfile.NamedTemporaryFile( suffix = ".ogg" , delete = False ) as tmp_audio:
audio_path = tmp_audio.name
try :
# Process video: 1 FPS, 360p, max 60 seconds
subprocess.run([
"ffmpeg" , "-y" , "-i" , video_path,
"-t" , "60" ,
"-vf" , "fps=1,scale=-1:360" ,
"-c:v" , "libx264" , "-preset" , "fast" ,
"-an" ,
processed_video_path
], check = True , capture_output = True )
# Extract audio: Opus/Ogg, mono, 16kHz, 24kbps
subprocess.run([
"ffmpeg" , "-y" , "-i" , video_path,
"-t" , "60" ,
"-vn" ,
"-c:a" , "libopus" ,
"-b:a" , "24k" ,
"-ar" , "16000" ,
"-ac" , "1" ,
audio_path
], check = True , capture_output = True )
with open (processed_video_path, "rb" ) as f:
video_b64 = base64.b64encode(f.read()).decode( "utf-8" )
with open (audio_path, "rb" ) as f:
audio_b64 = base64.b64encode(f.read()).decode( "utf-8" )
return video_b64, audio_b64
finally :
os.unlink(processed_video_path)
os.unlink(audio_path)
Preprocessing is highly recommended to reduce latency and ensure consistent performance.
Tips for optimal throughput:
Preprocess all videos – 1 FPS at 360p provides good quality with minimal tokens
Extract audio separately – Opus/Ogg at 24kbps offers excellent compression
Limit video duration – Cap at 60 seconds for consistent performance
Use dedicated deployments – Scale replicas based on your throughput needs
Known limitations
Video duration : Maximum 60 seconds recommended for optimal performance
Supported formats : .mp4 for video, .ogg (Opus) for audio
Base64 size : Total encoded payload should be under 10MB
Deployment required : Video models are not available on serverless; dedicated deployment required
Chat with Video using Qwen3 Omni Interactive notebook for video and audio analysis
Vision models Query models with image inputs
Dedicated deployments Deploy models on dedicated GPUs