Qwen-Omni - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

Qwen-Omni supports multiple input modalities, including video, audio, image, and text. It can output audio and text.

Overview and billing

Compared to Qwen-VL, Qwen-Omni can:

Understand visual and audio information in video files.
Understand multimodal data.
Output audio.

Qwen-Omni also excels in visual and audio understanding capabilities.

Open source

Name

Context

Maximum input

Maximum output

Free quota

(Note)

(Tokens)

qwen2.5-omni-7b

32,768

30,720

2,048

1 million tokens (regardless of modality)

Valid for 180 days after activation

You cannot use open-source Qwen-Omni after their free quota runs out. Please stay tuned for updates.

Calculate audio, image, and video tokens

Audio

Each second corresponds to 25 tokens. If the audio is shorter than 1 second, it is calculated as 25 tokens.

Image

Every 28 × 28 pixels correspond to 1 token. Each image converts to at least 4 tokens and at most 1,280 tokens. Run the following code to estimate the tokens of an image.

Python

import math
# Use the following command to install Pillow library: pip install Pillow
from PIL import Image

def token_calculate(image_path):
    # Open the specified PNG image file
    image = Image.open(image_path)
    # Get the original dimensions of the image
    height = image.height
    width = image.width
    # Adjust height to a multiple of 28
    h_bar = round(height / 28) * 28
    # Adjust width to a multiple of 28
    w_bar = round(width / 28) * 28
    # Image token lower limit: 4 tokens
    min_pixels = 28 * 28 * 4
    # Image token upper limit: 1280 tokens
    max_pixels = 1280 * 28 * 28
    # Scale the image to adjust the total number of pixels within the range [min_pixels, max_pixels]
    if h_bar * w_bar > max_pixels:
        # Calculate scaling factor beta so that the scaled image's total pixels do not exceed max_pixels
        beta = math.sqrt((height * width) / max_pixels)
        # Recalculate the adjusted height, ensuring it's a multiple of 28
        h_bar = math.floor(height / beta / 28) * 28
        # Recalculate the adjusted width, ensuring it's a multiple of 28
        w_bar = math.floor(width / beta / 28) * 28
    elif h_bar * w_bar < min_pixels:
        # Calculate scaling factor beta so that the scaled image's total pixels are not less than min_pixels
        beta = math.sqrt(min_pixels / (height * width))
        # Recalculate the adjusted height, ensuring it's a multiple of 28
        h_bar = math.ceil(height * beta / 28) * 28
        # Recalculate the adjusted width, ensuring it's a multiple of 28
        w_bar = math.ceil(width * beta / 28) * 28
    print(f"Scaled image dimensions: height {h_bar}, width {w_bar}")
    # Calculate the token count for the image: total pixels divided by 28 * 28
    token = int((h_bar * w_bar) / (28 * 28))
    # The system automatically adds <|vision_bos|> and <|vision_eos|> visual markers (1 token each)
    total_token = token + 2
    print(f"Image token count is {total_token}")    
    return total_token
if __name__ == "__main__":
    total_token = token_calculate(image_path="test.png")

Node.js

// Use the following command to install sharp: npm install sharp
import sharp from 'sharp';

async function tokenCalculate(imagePath) {
    // Open the specified PNG image file
    const image = sharp(imagePath);
    const metadata = await image.metadata();

    // Get the original dimensions of the image
    const height = metadata.height;
    const width = metadata.width;

    // Adjust height to a multiple of 28
    let hBar = Math.round(height / 28) * 28;
    // Adjust width to a multiple of 28
    let wBar = Math.round(width / 28) * 28;

    // Image token lower limit: 4 tokens
    const minPixels = 28 * 28 * 4;
    // Image token upper limit: 1280 tokens
    const maxPixels = 1280 * 28 * 28;

    // Scale the image to adjust the total number of pixels within the range [min_pixels, max_pixels]
    if (hBar * wBar > maxPixels) {
        // Calculate scaling factor beta so that the scaled image's total pixels do not exceed max_pixels
        const beta = Math.sqrt((height * width) / maxPixels);
        // Recalculate the adjusted height, ensuring it's a multiple of 28
        hBar = Math.floor(height / beta / 28) * 28;
        // Recalculate the adjusted width, ensuring it's a multiple of 28
        wBar = Math.floor(width / beta / 28) * 28;
    } else if (hBar * wBar < minPixels) {
        // Calculate scaling factor beta so that the scaled image's total pixels are not less than min_pixels
        const beta = Math.sqrt(minPixels / (height * width));
        // Recalculate the adjusted height, ensuring it's a multiple of 28
        hBar = Math.ceil(height * beta / 28) * 28;
        // Recalculate the adjusted width, ensuring it's a multiple of 28
        wBar = Math.ceil(width * beta / 28) * 28;
    }
    console.log(`Scaled image dimensions: height ${hBar}, width ${wBar}`);
    // Calculate the token count for the image: total pixels divided by 28 * 28
    const token = Math.floor((hBar * wBar) / (28 * 28));
    // The system automatically adds <|vision_bos|> and <|vision_eos|> visual markers (1 token each)
    console.log(`Total image token count is ${token + 2}`);
    const totalToken = token + 2;
    return totalToken;
}

// Replace test.png with your local image path
tokenCalculate('test.png').catch(err => {
    console.error('Error processing image:', err);
});

Video

Tokens in video files are divided into video_tokens and audio_tokens.

video_tokens

The calculation is relatively complex. Sample code:

Python

# Before using, install: pip install opencv-python
import math
import os
import logging
import cv2

logger = logging.getLogger(__name__)

# Fixed parameters
FRAME_FACTOR = 2
IMAGE_FACTOR = 28
# Aspect ratio of video frames
MAX_RATIO = 200

# Video frame token lower limit
VIDEO_MIN_PIXELS = 128 * 28 * 28
# Video frame token upper limit
VIDEO_MAX_PIXELS = 768 * 28 * 28

# Qwen-Omni model FPS is 2
FPS = 2
# Minimum number of frames to extract
FPS_MIN_FRAMES = 4
# Maximum number of frames to extract
FPS_MAX_FRAMES = 512

# Maximum pixel value for video input
VIDEO_TOTAL_PIXELS = 65536 * 28 * 28

def round_by_factor(number, factor):
    return round(number / factor) * factor

def ceil_by_factor(number, factor):
    return math.ceil(number / factor) * factor

def floor_by_factor(number, factor):
    return math.floor(number / factor) * factor

def get_video(video_path):
    cap = cv2.VideoCapture(video_path)
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    video_fps = cap.get(cv2.CAP_PROP_FPS)
    cap.release()
    return frame_height, frame_width, total_frames, video_fps

def smart_nframes(total_frames, video_fps):
    min_frames = ceil_by_factor(FPS_MIN_FRAMES, FRAME_FACTOR)
    max_frames = floor_by_factor(min(FPS_MAX_FRAMES, total_frames), FRAME_FACTOR)
    duration = total_frames / video_fps if video_fps != 0 else 0
    if duration - int(duration) > (1 / FPS):
        total_frames = math.ceil(duration * video_fps)
    else:
        total_frames = math.ceil(int(duration) * video_fps)
    nframes = total_frames / video_fps * FPS
    nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
    if not (FRAME_FACTOR <= nframes <= total_frames):
        raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
    return nframes

def smart_resize(height, width, nframes, factor=IMAGE_FACTOR):
    min_pixels = VIDEO_MIN_PIXELS
    total_pixels = VIDEO_TOTAL_PIXELS
    max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))
    if max(height, width) / min(height, width) > MAX_RATIO:
        raise ValueError(f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}")
    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar

def video_token_calculate(video_path):
    height, width, total_frames, video_fps = get_video(video_path)
    nframes = smart_nframes(total_frames, video_fps)
    resized_height, resized_width = smart_resize(height, width, nframes)
    video_token = int(math.ceil(nframes / FPS) * resized_height / 28 * resized_width / 28)
    video_token += 2  # visual markers
    return video_token

if __name__ == "__main__":
    video_path = "spring_mountain.mp4"  # your video path
    video_token = video_token_calculate(video_path)
    print("video_tokens:", video_token)

Node.js

// Before using, install: npm install node-ffprobe @ffprobe-installer/ffprobe
import ffprobeInstaller from '@ffprobe-installer/ffprobe';
import ffprobe from 'node-ffprobe';
// Set ffprobe path
ffprobe.FFPROBE_PATH = ffprobeInstaller.path;

// Get video information
async function getVideoInfo(videoPath) {
  try {
    const probeData = await ffprobe(videoPath);
    const videoStream = probeData.streams.find(s => s.codec_type === 'video');
    if (!videoStream) throw new Error('No video stream found');
    const { width, height, nb_frames: totalFrames, avg_frame_rate } = videoStream;
    const [numerator, denominator] = avg_frame_rate.split('/');
    const frameRate = parseFloat(numerator) / parseFloat(denominator);
    return { width, height, totalFrames: parseInt(totalFrames), frameRate };
  } catch (error) {
    console.error('Failed to get video information:', error.message);
    throw error;
  }
}

// Constants configuration
const CONFIG = {
  FRAME_FACTOR: 2,
  IMAGE_FACTOR: 28,
  MAX_RATIO: 200,
  VIDEO_MIN_PIXELS: 128 * 28 * 28,
  VIDEO_MAX_PIXELS: 768 * 28 * 28,
  FPS: 2,
  FPS_MIN_FRAMES: 4,
  FPS_MAX_FRAMES: 512,
  VIDEO_TOTAL_PIXELS: 65536 * 28 * 28,
};

// Factor rounding utilities
function byFactor(number, factor, mode = 'round') {
  if (mode === 'ceil') return Math.ceil(number / factor) * factor;
  if (mode === 'floor') return Math.floor(number / factor) * factor;
  return Math.round(number / factor) * factor;
}

// Calculate frame extraction count
function smartNFrames(ele, totalFrames, frameRate) {
  const fps = ele.fps || CONFIG.FPS;
  const minFrames = byFactor(ele.min_frames || CONFIG.FPS_MIN_FRAMES, CONFIG.FRAME_FACTOR, 'ceil');
  const maxFrames = byFactor(
    ele.max_frames || Math.min(CONFIG.FPS_MAX_FRAMES, totalFrames),
    CONFIG.FRAME_FACTOR,
    'floor'
  );
  const duration = frameRate ? totalFrames / frameRate : 0;
  let totalFramesAdjusted = duration % 1 > (1 / fps)
    ? Math.ceil(duration * frameRate)
    : Math.ceil(Math.floor(duration) * frameRate);
  const nframes = (totalFramesAdjusted / frameRate) * fps;
  const finalNFrames = Math.min(
    Math.max(nframes, minFrames),
    Math.min(maxFrames, totalFramesAdjusted)
  );
  if (finalNFrames < CONFIG.FRAME_FACTOR || finalNFrames > totalFramesAdjusted) {
    throw new Error(`Frame count should be between ${CONFIG.FRAME_FACTOR} and ${totalFramesAdjusted}, current: ${finalNFrames}`);
  }
  return Math.floor(finalNFrames);
}

// Smart resolution adjustment
async function smartResize(ele, videoInfo) {
  const { height, width, totalFrames, frameRate } = videoInfo;
  const minPixels = CONFIG.VIDEO_MIN_PIXELS;
  const nframes = smartNFrames(ele, totalFrames, frameRate);
  const maxPixels = Math.max(
    Math.min(CONFIG.VIDEO_MAX_PIXELS, CONFIG.VIDEO_TOTAL_PIXELS / nframes * CONFIG.FRAME_FACTOR),
    Math.floor(minPixels * 1.05)
  );
  const ratio = Math.max(height, width) / Math.min(height, width);
  if (ratio > CONFIG.MAX_RATIO) throw new Error(`Aspect ratio ${ratio} exceeds limit ${CONFIG.MAX_RATIO}`);
  let hBar = Math.max(CONFIG.IMAGE_FACTOR, byFactor(height, CONFIG.IMAGE_FACTOR));
  let wBar = Math.max(CONFIG.IMAGE_FACTOR, byFactor(width, CONFIG.IMAGE_FACTOR));
  if (hBar * wBar > maxPixels) {
    const beta = Math.sqrt((height * width) / maxPixels);
    hBar = byFactor(height / beta, CONFIG.IMAGE_FACTOR, 'floor');
    wBar = byFactor(width / beta, CONFIG.IMAGE_FACTOR, 'floor');
  } else if (hBar * wBar < minPixels) {
    const beta = Math.sqrt(minPixels / (height * width));
    hBar = byFactor(height * beta, CONFIG.IMAGE_FACTOR, 'ceil');
    wBar = byFactor(width * beta, CONFIG.IMAGE_FACTOR, 'ceil');
  }
  return { hBar, wBar };
}

// Calculate token count
async function tokenCalculate(videoPath) {
  const messages = [{ content: [{ video: videoPath, FPS: CONFIG.FPS }] }];
  const visionInfos = extractVisionInfo(messages);
  const videoInfo = await getVideoInfo(videoPath);
  const { hBar, wBar } = await smartResize(visionInfos[0], videoInfo);
  const { totalFrames, frameRate } = videoInfo;
  const numFrames = smartNFrames(visionInfos[0], totalFrames, frameRate);
  const videoToken = Math.ceil(numFrames / 2) * Math.floor(hBar / 28) * Math.floor(wBar / 28) + 2;
  return videoToken;
}

// Extract visual information
function extractVisionInfo(conversations) {
  const visionInfos = [];
  if (!Array.isArray(conversations)) conversations = [conversations];
  conversations.forEach(conversation => {
    if (!Array.isArray(conversation)) conversation = [conversation];
    conversation.forEach(message => {
      if (Array.isArray(message.content)) {
        message.content.forEach(ele => {
          if (ele.image || ele.image_url || ele.video || ['image', 'image_url', 'video'].includes(ele.type)) {
            visionInfos.push(ele);
          }
        });
      }
    });
  });
  return visionInfos;
}

async function main() {
  try {
    const videoPath = "spring_mountain.mp4"; // Replace with local path
    const videoToken = await tokenCalculate(videoPath);
    console.log('Video tokens:', videoToken);
  } catch (error) {
    console.error('Error:', error.message);
  }
}

main();

audio_tokens
Each second of audio corresponds to 25 tokens. If the audio is shorter than 1 second, it is calculated as 25 tokens.

Usage notes

Input

Input modalities

The following input combinations are supported:

Text
Image+Text
Audio+Text
Video (including image sequence or video file)+Text

Do not put multiple non-text modalities in a single user message.

Input method

Image, audio, and video files support Base64 encoding and public URL. The following sample codes use public URLs. To use Base64-encoded files, see Input Base64 encoded local file.

Output

Currently, Qwen-Omni only supports streaming output.

Output modalities

The output can include text and audio. You can use the modalities parameter to control.

Output modality

modalities value

Response style

Text

["text"] (default)

Relatively formal.

Text+Audio

["text","audio"]

Casual and guides the user to further communicate.

When output modality includes audio, do not set system message.

The output audio is Base64 encoded and requires decoding, see Parse Base64 encoded audio output.

Audio languages

Currently, output audio only supports Mandarin Chinese and English.

Audio voices

The audio parameter controls the voice and file format (which only supports "wav"). Example: audio={"voice": "Chelsie", "format": "wav"}.

Valid values for voice: ["Ethan", "Chelsie"].

Voice name	Voice effect
Ethan
Chelsie

Get started

Prerequisites

Qwen-Omni supports only OpenAI-compatible methods. You must first obtain an API Key and set the API key as an environment variable. If using the OpenAI SDK, you must also install the SDK. We recommend that you follow this topic to install the latest version, otherwise your request may fail.

You must have OpenAI Python SDK version 1.52.0 or later, or OpenAI Node.js SDK version 4.68.0 or later.

Text input

Qwen-Omni can accept plain text as input. Currently, only streaming output is supported.

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # If environment variables are not configured, replace the line below with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen2.5-omni-7b",
    messages=[{"role": "user", "content": "Who are you"}],
    # Set output data modalities, currently supports two types: ["text","audio"], ["text"]
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    # stream must be set to True, otherwise an error will occur
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in completion:
    if chunk.choices:
        print(chunk.choices[0].delta)
    else:
        print(chunk.usage)

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the line below with: apiKey: "sk-xxx" using your Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen2.5-omni-7b",
    messages: [
        { role: "user", content: "Who are you?" }
    ],
    stream: true,
    stream_options: {
        include_usage: true
    },
    modalities: ["text", "audio"],
    audio: { voice: "Chelsie", format: "wav" }
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta);
    } else {
        console.log(chunk.usage);
    }
}

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen2.5-omni-7b",
    "messages": [
        {
            "role": "user", 
            "content": "Who are you?"
        }
    ],
    "stream":true,
    "stream_options":{
        "include_usage":true
    },
    "modalities":["text","audio"],
    "audio":{"voice":"Chelsie","format":"wav"}
}'

Image+Text input

Qwen-Omni can accept multiple images at a time. The requirements for input images are:

The size of a single image file must not exceed 10 MB.
The number of images is limited by the model's total token limit for text and images (that is, maximum input). The total token count of all images must be less than the model's maximum input.
The width and height of images must be greater than 10 pixels. The aspect ratio must not exceed 200:1 or 1:200.

Learn about supported images formats.

Currently, only streaming output is supported.

The following sample codes use public image URLs. To use local files, see Input Base64 encoded local file.

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # If environment variables are not configured, replace the line below with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen2.5-omni-7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
                    },
                },
                {"type": "text", "text": "What scene is depicted in this image?"},
            ],
        },
    ],
    # Set output data modalities, currently supports two types: ["text","audio"], ["text"]
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    # stream must be set to True, otherwise an error will occur
    stream=True,
    stream_options={
        "include_usage": True
    }
)

for chunk in completion:
    if chunk.choices:
        print(chunk.choices[0].delta)
    else:
        print(chunk.usage)

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the line below with: apiKey: "sk-xxx" using your Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen2.5-omni-7b",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "image_url",
                "image_url": { "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg" },
            },
            { "type": "text", "text": "What scene is depicted in this image?" }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
    modalities: ["text", "audio"],
    audio: { voice: "Chelsie", format: "wav" }
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta);
    } else {
        console.log(chunk.usage);
    }
}

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen2.5-omni-7b",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          }
        },
        {
          "type": "text",
          "text": "What scene is depicted in this image?"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{
        "include_usage":true
    },
    "modalities":["text","audio"],
    "audio":{"voice":"Chelsie","format":"wav"}
}'

Audio+Text input

Qwen-Omni can accept only one audio file at a time, with a size limit of 10 MB and a duration limit of 3 minutes. Currently, only streaming output is supported.

The following sample codes use public audio URLs. To use local files, see Input Base64 encoded local file.

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # If environment variables are not configured, replace the line below with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen2.5-omni-7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                        "format": "wav",
                    },
                },
                {"type": "text", "text": "What is this audio saying"},
            ],
        },
    ],
    # Set output data modalities, currently supports two types: ["text","audio"], ["text"]
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    # stream must be set to True, otherwise an error will occur
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in completion:
    if chunk.choices:
        print(chunk.choices[0].delta)
    else:
        print(chunk.usage)

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the line below with: apiKey: "sk-xxx" using your Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen2.5-omni-7b",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": { "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav", "format": "wav" },
            },
            { "type": "text", "text": "What is this audio saying" }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
    modalities: ["text", "audio"],
    audio: { voice: "Chelsie", format: "wav" }
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta);
    } else {
        console.log(chunk.usage);
    }
}

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen2.5-omni-7b",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
            "format": "wav"
          }
        },
        {
          "type": "text",
          "text": "What is this audio saying"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{
        "include_usage":true
    },
    "modalities":["text","audio"],
    "audio":{"voice":"Chelsie","format":"wav"}
}'

Video+Text input

Qwen-Omni can accept video as an image sequence or a video file (it can understand audio in the video). Currently, only streaming output is supported.

Image sequence
At least 4 images and at most 80 images.
Video file
Only one video file, with a size limit of 150 MB and a duration limit of 40 seconds.
Visual and audio information in video files are billed separately.

The following sample codes use public video URLs. To use local files, see Input Base64 encoded local file.

Image sequence

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # If environment variables are not configured, replace the line below with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen2.5-omni-7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": [
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg",
                    ],
                },
                {"type": "text", "text": "Describe the specific process in this video"},
            ],
        }
    ],
    # Set output data modalities, currently supports two types: ["text","audio"], ["text"]
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    # stream must be set to True, otherwise an error will occur
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in completion:
    if chunk.choices:
        print(chunk.choices[0].delta)
    else:
        print(chunk.usage)

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the line below with: apiKey: "sk-xxx" using your Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen2.5-omni-7b",
    messages: [{
        role: "user",
        content: [
            {
                type: "video",
                video: [
                    "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                    "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                    "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                    "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
                ]
            },
            {
                type: "text",
                text: "Describe the specific process in this video"
            }
        ]
    }],
    stream: true,
    stream_options: {
        include_usage: true
    },
    modalities: ["text", "audio"],
    audio: { voice: "Chelsie", format: "wav" }
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta);
    } else {
        console.log(chunk.usage);
    }
}

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen2.5-omni-7b",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": [
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
                    ]
                },
                {
                    "type": "text",
                    "text": "Describe the specific process in this video"
                }
            ]
        }
    ],
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "modalities": ["text", "audio"],
    "audio": {
        "voice": "Chelsie",
        "format": "wav"
    }
}'

Video file (Qwen-Omni can understand audio in the video)

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # If environment variables are not configured, replace the line below with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen2.5-omni-7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
                    },
                },
                {"type": "text", "text": "What is the content of the video?"},
            ],
        },
    ],
    # Set output data modalities, currently supports two types: ["text","audio"], ["text"]
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    # stream must be set to True, otherwise an error will occur
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in completion:
    if chunk.choices:
        print(chunk.choices[0].delta)
    else:
        print(chunk.usage)

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the line below with: apiKey: "sk-xxx" using your Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen2.5-omni-7b",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "video_url",
                "video_url": { "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4" },
            },
            { "type": "text", "text": "What is the content of the video?" }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
    modalities: ["text", "audio"],
    audio: { voice: "Chelsie", format: "wav" }
});


for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta);
    } else {
        console.log(chunk.usage);
    }
}

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen2.5-omni-7b",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "video_url",
          "video_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
          }
        },
        {
          "type": "text",
          "text": "What is the content of the video"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options": {
        "include_usage": true
    },
    "modalities":["text","audio"],
    "audio":{"voice":"Chelsie","format":"wav"}
}'

Multi-round conversation

When using the multi-round conversation feature of Qwen-Omni, take note of the following:

Assistant message
Assistant messages added to the messages array can only contain text.
User message
A user message can only contain text and one type of non-text data. In multi-round conversations, you can put different types of data in different user messages.

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # If environment variable is not configured, replace the following line with: api_key="sk-xxx", using Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen2.5-omni-7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3",
                        "format": "mp3",
                    },
                },
                {"type": "text", "text": "What is this audio saying"},
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": "This audio is saying: Welcome to Alibaba Cloud"}],
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": "Can you introduce this company?"}],
        },
    ],
    # Set the modality of output data, currently supporting two types: ["text","audio"], ["text"]
    modalities=["text"],
    # stream must be set to True, otherwise an error will occur
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in completion:
    if chunk.choices:
        print(chunk.choices[0].delta)
    else:
        print(chunk.usage)

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the following line with: apiKey: "sk-xxx", using Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen2.5-omni-7b",
    messages: [
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3",
                        "format": "mp3",
                    },
                },
                { "type": "text", "text": "What is this audio saying" },
            ],
        },
        {
            "role": "assistant",
            "content": [{ "type": "text", "text": "This audio is saying: Welcome to Alibaba Cloud" }],
        },
        {
            "role": "user",
            "content": [{ "type": "text", "text": "Can you introduce this company?" }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
    modalities: ["text"]
});


for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta);
    } else {
        console.log(chunk.usage);
    }
}

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen2.5-omni-7b",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
          }
        },
        {
          "type": "text",
          "text": "What is this audio saying"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "This audio is saying: Welcome to Alibaba Cloud"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Can you introduce this company?"
        }
      ]
    }
  ],
  "stream": true,
  "stream_options": {
    "include_usage": true
  },
  "modalities": ["text"]
}'

Parse Base64 encoded audio output

Qwen-Omni outputs Base64 encoded data in stream. You can maintain a string variable during model generation, add the Base64 encoding of each chunk to the string variable, and then decode it. Alternatively, you can decode and play each chunk's Base64 encoding in real-time.

Python

# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

client = OpenAI(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx" using Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen2.5-omni-7b",
    messages=[{"role": "user", "content": "Who are you"}],
    # Set the output data modality, currently supports two types: ["text","audio"], ["text"]
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    # stream must be set to True, otherwise an error will occur
    stream=True,
    stream_options={"include_usage": True},
)

# Method 1: Decode after generation is complete
audio_string = ""
for chunk in completion:
    if chunk.choices:
        if hasattr(chunk.choices[0].delta, "audio"):
            try:
                audio_string += chunk.choices[0].delta.audio["data"]
            except Exception as e:
                print(chunk.choices[0].delta.audio["transcript"])
    else:
        print(chunk.usage)

wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
sf.write("audio_assistant_py.wav", audio_np, samplerate=24000)

# Method 2: Decode while generating (comment out Method 1 code when using Method 2)
# # Initialize PyAudio
# import pyaudio
# import time
# p = pyaudio.PyAudio()
# # Create audio stream
# stream = p.open(format=pyaudio.paInt16,
#                 channels=1,
#                 rate=24000,
#                 output=True)

# for chunk in completion:
#     if chunk.choices:
#         if hasattr(chunk.choices[0].delta, "audio"):
#             try:
#                 audio_string = chunk.choices[0].delta.audio["data"]
#                 wav_bytes = base64.b64decode(audio_string)
#                 audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
#                 # Play audio data directly
#                 stream.write(audio_np.tobytes())
#             except Exception as e:
#                 print(chunk.choices[0].delta.audio["transcript"])

# time.sleep(0.8)
# # Clean up resources
# stream.stop_stream()
# stream.close()
# p.terminate()

Node.js

// Preparations before running:
// Windows/Mac/Linux common:
// 1. Make sure Node.js is installed (recommended version >= 14)
// 2. Run the following commands to install necessary dependencies:
//    npm install openai wav
// 
// If you want to use real-time playback (Method 2), you also need:
// Windows:
//    npm install speaker
// Mac:
//    brew install portaudio
//    npm install speaker
// Linux (Ubuntu/Debian):
//    sudo apt-get install libasound2-dev
//    npm install speaker

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the following line with: apiKey: "sk-xxx" using Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen2.5-omni-7b",
    messages: [
        {
            "role": "user",
            "content": "Who are you?"
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
    modalities: ["text", "audio"],
    audio: { voice: "Chelsie", format: "wav" }
});

// Method 1: Decode after generation is complete
// Need to install: npm install wav
import { createWriteStream } from 'node:fs';  // node:fs is a built-in Node.js module, no need to install
import { Writer } from 'wav';

async function convertAudio(audioString, audioPath) {
    try {
        // Decode Base64 string to Buffer
        const wavBuffer = Buffer.from(audioString, 'base64');
        // Create WAV file write stream
        const writer = new Writer({
            sampleRate: 24000,  // Sample rate
            channels: 1,        // Mono channel
            bitDepth: 16        // 16-bit depth
        });
        // Create output file stream and establish pipe connection
        const outputStream = createWriteStream(audioPath);
        writer.pipe(outputStream);

        // Write PCM data and end writing
        writer.write(wavBuffer);
        writer.end();

        // Use Promise to wait for file writing to complete
        await new Promise((resolve, reject) => {
            outputStream.on('finish', resolve);
            outputStream.on('error', reject);
        });

        // Add extra waiting time to ensure audio is complete
        await new Promise(resolve => setTimeout(resolve, 800));

        console.log(`Audio file has been successfully saved as ${audioPath}`);
    } catch (error) {
        console.error('Error occurred during processing:', error);
    }
}

let audioString = "";
for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        if (chunk.choices[0].delta.audio) {
            if (chunk.choices[0].delta.audio["data"]) {
                audioString += chunk.choices[0].delta.audio["data"];
            }
        }
    } else {
        console.log(chunk.usage);
    }
}
// Execute conversion
convertAudio(audioString, "audio_assistant_mjs.wav");


// Method 2: Real-time playback while generating
// Need to install necessary components according to the system instructions above
// import Speaker from 'speaker'; // Import audio playback library

// // Create speaker instance (configuration consistent with WAV file parameters)
// const speaker = new Speaker({
//     sampleRate: 24000,  // Sample rate
//     channels: 1,        // Number of sound channels
//     bitDepth: 16,       // Bit depth
//     signed: true        // Signed PCM
// });
// for await (const chunk of completion) {
//     if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
//         if (chunk.choices[0].delta.audio) {
//             if (chunk.choices[0].delta.audio["data"]) {
//                 const pcmBuffer = Buffer.from(chunk.choices[0].delta.audio.data, 'base64');
//                 // Write directly to speaker for playback
//                 speaker.write(pcmBuffer);
//             }
//         }
//     } else {
//         console.log(chunk.usage);
//     }
// }
// speaker.on('finish', () => console.log('Playback complete'));
// speaker.end(); // Call based on actual API stream end condition

Input Base64 encoded local file

Image

Using eagle.png saved locally as an example.

Python

import os
from openai import OpenAI
import base64

client = OpenAI(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx" using Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)


#  Base64 encoding format
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


base64_image = encode_image("eagle.png")

completion = client.chat.completions.create(
    model="qwen2.5-omni-7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                },
                {"type": "text", "text": "What scene is depicted in this image?"},
            ],
        },
    ],
    # Set the output data modality, currently supports two types: ["text","audio"], ["text"]
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    # stream must be set to True, otherwise an error will occur
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in completion:
    if chunk.choices:
        print(chunk.choices[0].delta)
    else:
        print(chunk.usage)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the following line with: apiKey: "sk-xxx" using Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
};
const base64Image = encodeImage("eagle.png")

const completion = await openai.chat.completions.create({
    model: "qwen2.5-omni-7b",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "image_url",
                "image_url": { "url": `data:image/png;base64,${base64Image}` },
            },
            { "type": "text", "text": "What scene is depicted in this image?" }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
    modalities: ["text", "audio"],
    audio: { voice: "Chelsie", format: "wav" }
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta);
    } else {
        console.log(chunk.usage);
    }
}

Audio

Using welcome.mp3 saved locally as an example.

Python

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf
import requests

client = OpenAI(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx" using Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)


def encode_audio(audio_path):
    with open(audio_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")


base64_audio = encode_audio("welcome.mp3")

completion = client.chat.completions.create(
    model="qwen2.5-omni-7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": f"data:;base64,{base64_audio}",
                        "format": "mp3",
                    },
                },
                {"type": "text", "text": "What is this audio saying"},
            ],
        },
    ],
    # Set the output data modality, currently supports two types: ["text","audio"], ["text"]
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    # stream must be set to True, otherwise an error will occur
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in completion:
    if chunk.choices:
        print(chunk.choices[0].delta)
    else:
        print(chunk.usage)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the following line with: apiKey: "sk-xxx" using Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeAudio = (audioPath) => {
    const audioFile = readFileSync(audioPath);
    return audioFile.toString('base64');
};
const base64Audio = encodeAudio("welcome.mp3")

const completion = await openai.chat.completions.create({
    model: "qwen2.5-omni-7b",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": { "data": `data:;base64,${base64Audio}`, "format": "mp3" },
            },
            { "type": "text", "text": "What is this audio saying" }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
    modalities: ["text", "audio"],
    audio: { voice: "Chelsie", format: "wav" }
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta);
    } else {
        console.log(chunk.usage);
    }
}

Video

Video files

Using spring_mountain.mp4 saved locally as an example.

Python

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

client = OpenAI(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx" using Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

#  Base64 encoding format
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")


base64_video = encode_video("spring_mountain.mp4")

completion = client.chat.completions.create(
    model="qwen2.5-omni-7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {"url": f"data:;base64,{base64_video}"},
                },
                {"type": "text", "text": "What is she singing"},
            ],
        },
    ],
    # Set the output data modality, currently supports two types: ["text","audio"], ["text"]
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    # stream must be set to True, otherwise an error will occur
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in completion:
    if chunk.choices:
        print(chunk.choices[0].delta)
    else:
        print(chunk.usage)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the following line with: apiKey: "sk-xxx" using Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeVideo = (videoPath) => {
    const videoFile = readFileSync(videoPath);
    return videoFile.toString('base64');
};
const base64Video = encodeVideo("spring_mountain.mp4")

const completion = await openai.chat.completions.create({
    model: "qwen2.5-omni-7b",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "video_url",
                "video_url": { "url": `data:;base64,${base64Video}` },
            },
            { "type": "text", "text": "What is she singing" }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
    modalities: ["text", "audio"],
    audio: { voice: "Chelsie", format: "wav" }
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta);
    } else {
        console.log(chunk.usage);
    }
}

Image sequence

Using football1.jpg, football2.jpg, football3.jpg and football4.jpg saved locally as examples.

Python

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

client = OpenAI(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx" using Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)


#  Base64 encoding format
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


base64_image_1 = encode_image("football1.jpg")
base64_image_2 = encode_image("football2.jpg")
base64_image_3 = encode_image("football3.jpg")
base64_image_4 = encode_image("football4.jpg")

completion = client.chat.completions.create(
    model="qwen2.5-omni-7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": [
                        f"data:image/jpeg;base64,{base64_image_1}",
                        f"data:image/jpeg;base64,{base64_image_2}",
                        f"data:image/jpeg;base64,{base64_image_3}",
                        f"data:image/jpeg;base64,{base64_image_4}",
                    ],
                },
                {"type": "text", "text": "Describe the specific process of this video"},
            ],
        }
    ],
    # Set the output data modality, currently supports two types: ["text","audio"], ["text"]
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    # stream must be set to True, otherwise an error will occur
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in completion:
    if chunk.choices:
        print(chunk.choices[0].delta)
    else:
        print(chunk.usage)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the following line with: apiKey: "sk-xxx" using Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
const base64Image1 = encodeImage("football1.jpg")
const base64Image2 = encodeImage("football2.jpg")
const base64Image3 = encodeImage("football3.jpg")
const base64Image4 = encodeImage("football4.jpg")

const completion = await openai.chat.completions.create({
    model: "qwen2.5-omni-7b",
    messages: [{
        role: "user",
        content: [
            {
                type: "video",
                video: [
                    `data:image/jpeg;base64,${base64Image1}`,
                    `data:image/jpeg;base64,${base64Image2}`,
                    `data:image/jpeg;base64,${base64Image3}`,
                    `data:image/jpeg;base64,${base64Image4}`
                ]
            },
            {
                type: "text",
                text: "Describe the specific process of this video"
            }
        ]
    }],
    stream: true,
    stream_options: {
        include_usage: true
    },
    modalities: ["text", "audio"],
    audio: { voice: "Chelsie", format: "wav" }
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta);
    } else {
        console.log(chunk.usage);
    }
}

Error codes

If the call failed and an error message is returned, see Error messages.