Qwen2.5-VL及Qwen3-VL关于default coordinate system的问题

本文对比了Qwen2.5-VL和Qwen3-VL在目标检测任务中坐标转换方法的差异。Qwen2.5-VL需要先获取预处理后的图片尺寸，再将模型输出的绝对坐标转换回原图坐标；而Qwen3-VL改用相对坐标系统（0-1000范围），可直接将输出坐标映射到原图尺寸，无需处理图片resize问题。主要区别在于Qwen3-VL采用了归一化的坐标表示方法，这有助于模型在不同分辨率和比例的图像上表现更稳定。文中

mr.sorghum

1546人浏览 · 2025-11-13 19:57:20

mr.sorghum · 2025-11-13 19:57:20 发布

背景

使用VLM做多模态object detection, specific target grounding、OCR时，涉及到由模型给出识别到的bonding box，然而，由于VLM模型会对输入图片做预处理（resize为另一个大小），因此需要对模型输出的bonding box坐标做转换。实验过程中发现Qwen2.5-VL和Qwen3-VL输出的坐标转换后得到的结果不一致，因此写此文进行分享。

Qwen2.5-VL

参考文章
https://github.com/QwenLM/Qwen3-VL/issues/931

以下代码均参考参考文章

做法

获取预处理后图片的尺寸（width,height）

# Single image processing
inputs = processor(images=[image], return_tensors="pt")

# one grid has a size of 14x14, image_grid_thw is a list of [t, h_grid, w_grid], for images, t is always 1
input_height = inputs['image_grid_thw'][0][1]*14 
input_width = inputs['image_grid_thw'][0][2]*14

# Multiple image processing
inputs = processor(images=[image1, image2], return_tensors="pt")
input_height_idx = inputs['image_grid_thw'][idx][1]*14
input_width_idx = inputs['image_grid_thw'][idx][2]*14.

2.根据模型输出的坐标做转换

# Given original size(原始图片的尺寸) img.size: width, height
# Given model input size(第一步得到的输入尺寸): input_height, input_width

# Converting model output（模型输出的bonding box）(xmin,ymin,xmax,ymax) [output_x1, output_y1, output_x2, output_y2] to actual coords [abs_x1, abs_y1, abs_x2, abs_y2]
abs_x1 = int(output_x1 / input_width * width)
abs_y1 = int(output_y1 / input_height * height)
abs_x2 = int(output_x2 / input_width * width)
abs_y2 = int(output_y2 / input_height * height)

# Visualization can then be performed on the original image using these absolute coordinates.

在原始图片上绘制识别框即可
注：构建训练数据集训练模型时也需要对原始输入的坐标转变为模型的输入坐标

# Convert actual coords [abs_x1, abs_y1, abs_x2, abs_y2] to model input/output coords
input_or_output_x1 = int(abs_x1 / width * input_width)
input_or_output_y1 = int(abs_y1 / height * input_height)
input_or_output_x2 = int(abs_x2 / width * input_width)
input_or_output_y2 = int(abs_y2 / height * input_height)

// origin
{
	'image': ['test.jpg'],
	'instruction': "What's this in [abs_x1, abs_y1, abs_x2, abs_y2] ?",
        'boxes': [abs_x1, abs_y1, abs_x2, abs_y2]
	...
}

// converted
{
	'image': ['test.jpg'],
	'instruction': "What's this in [input_x1, input_y1, input_x2, input_y2] ?",
        'boxes': [input_x1, input_y1, input_x2, input_y2]
	...
}

// origin
{
	'image': ['test.jpg'],
	'instruction': "Where is the cat?",
        'response': "It's in [abs_x1, abs_y1, abs_x2, abs_y2] ?",
        'boxes': [abs_x1, abs_y1, abs_x2, abs_y2]
	...
}

// converted
{
	'image': ['test.jpg'],
	'instruction': "Where is the cat?",
        'response': "It's in [output_x1, output_y1, output_x2, output_y2] ?",
        'boxes': [output_x1, output_y1, output_x2, output_y2]
	...
}

Qwen3-VL

参考文章
https://github.com/QwenLM/Qwen3-VL/issues/1521
https://github.com/QwenLM/Qwen3-VL/blob/main/cookbooks/2d_grounding.ipynb

与Qwen2.5-VL的区别：

First of all, we list the major updates of Qwen3-VL’s spatial understanding abilities as follows:
Coordinate System: Qwen3-VL’s default coordinate system has been changed from the absolute coordinates used in Qwen2.5-VL to relative coordinates ranging from 0 to 1000. (You don’t need to calculate the resized_w)
Multi-Target Grounding: Qwen3-VL has improved its multi-target grounding ability.

做法

由于采用了相对的坐标，因此无需获取模型处理后的图片尺寸,只需输入模型输出的坐标及原图像的坐标即可

width, height = img.size
draw = ImageDraw.Draw(img)

abs_y1 = int(bounding_box["bbox_2d"][1] / 1000 * height) # ymin
abs_x1 = int(bounding_box["bbox_2d"][0] / 1000 * width) # xmin
abs_y2 = int(bounding_box["bbox_2d"][3] / 1000 * height) # ymax
abs_x2 = int(bounding_box["bbox_2d"][2] / 1000 * width) # xmax

if abs_x1 > abs_x2:
    abs_x1, abs_x2 = abs_x2, abs_x1

if abs_y1 > abs_y2:
    abs_y1, abs_y2 = abs_y2, abs_y1

# Draw the bounding box
draw.rectangle(
    ((abs_x1, abs_y1), (abs_x2, abs_y2)), outline=color, width=3
)

Q:为什么是1000？(截取自参考文章)

No, the image itself is not stretched or rescaled to 1000×1000. The model processes images in their original aspect ratio. We only normalize the output coordinates to a 1000×1000 reference grid for consistency.
This means the coordinate space is fixed, but the visual features are extracted from the original-resolution image. In practice, this design helps the model generalize better across different resolutions and aspect ratios without distortion.

智能体开发者社区

中国智能体开发者社区，聚焦智能体与大模型开发，提供前沿资讯、实用工具链、开源项目及行业案例。通过技术沙龙、开发者大赛等活动，促进经验交流与协作，助力开发者快速构建创新智能应用。

更多推荐

OpenClaw 本地部署完整指南（Windows + Ollama）

本文档基于实际部署经验编写，旨在帮助你在 Windows 系统上从零开始搭建 OpenClaw，并连接本地 Ollama 模型（如 Qwen2.5 或 Qwen3），使其具备完整的智能体能力。文档包含了所有关键步骤以及常见问题的解决方案。

智能体开发者社区

OpenClaw 小白安装指南（Windows版）

（类似一个能自动执行任务的AI机器人），不是游戏。API Key只保存在你本地电脑的加密文件里，不会上传到任何地方。访问：https://github.com/miaoxworld/openclaw-manager/releases。: 一键安装脚本会自动安装Node.js 22+，如果失败，手动下载安装：https://nodejs.org/：在PowerShell中，鼠标右键就是粘贴，不需要按

智能体开发者社区

飞书 × OpenClaw 接入指南：不用服务器，用长连接把机器人跑起来

这个项目存在的意义，就是把“飞书接 OpenClaw”这件事，整理成一套的配置入口，并把官方文档没覆盖到的坑集中写成排查清单。先说清楚它的角色：OpenClaw 现在已经内置官方飞书插件 @openclaw/feishu，功能更完整、维护也更及时。，说明飞书 + AI 的接入已经走通。另外，仓库也推荐了一个新项目：把 OpenClaw 变成“多 Agent 团队”，用多个 Agent 分工，Sla