Qwen2.5-VL及Qwen3-VL关于default coordinate system的问题
本文对比了Qwen2.5-VL和Qwen3-VL在目标检测任务中坐标转换方法的差异。Qwen2.5-VL需要先获取预处理后的图片尺寸,再将模型输出的绝对坐标转换回原图坐标;而Qwen3-VL改用相对坐标系统(0-1000范围),可直接将输出坐标映射到原图尺寸,无需处理图片resize问题。主要区别在于Qwen3-VL采用了归一化的坐标表示方法,这有助于模型在不同分辨率和比例的图像上表现更稳定。文中
背景
使用VLM做多模态object detection, specific target grounding、OCR时,涉及到由模型给出识别到的bonding box,然而,由于VLM模型会对输入图片做预处理(resize为另一个大小),因此需要对模型输出的bonding box坐标做转换。实验过程中发现Qwen2.5-VL和Qwen3-VL输出的坐标转换后得到的结果不一致,因此写此文进行分享。
Qwen2.5-VL
参考文章
https://github.com/QwenLM/Qwen3-VL/issues/931
以下代码均参考参考文章
做法
- 获取预处理后图片的尺寸(width,height)
# Single image processing
inputs = processor(images=[image], return_tensors="pt")
# one grid has a size of 14x14, image_grid_thw is a list of [t, h_grid, w_grid], for images, t is always 1
input_height = inputs['image_grid_thw'][0][1]*14
input_width = inputs['image_grid_thw'][0][2]*14
# Multiple image processing
inputs = processor(images=[image1, image2], return_tensors="pt")
input_height_idx = inputs['image_grid_thw'][idx][1]*14
input_width_idx = inputs['image_grid_thw'][idx][2]*14.
2.根据模型输出的坐标做转换
# Given original size(原始图片的尺寸) img.size: width, height
# Given model input size(第一步得到的输入尺寸): input_height, input_width
# Converting model output(模型输出的bonding box)(xmin,ymin,xmax,ymax) [output_x1, output_y1, output_x2, output_y2] to actual coords [abs_x1, abs_y1, abs_x2, abs_y2]
abs_x1 = int(output_x1 / input_width * width)
abs_y1 = int(output_y1 / input_height * height)
abs_x2 = int(output_x2 / input_width * width)
abs_y2 = int(output_y2 / input_height * height)
# Visualization can then be performed on the original image using these absolute coordinates.
- 在原始图片上绘制识别框即可
- 注:构建训练数据集训练模型时也需要对原始输入的坐标转变为模型的输入坐标
# Convert actual coords [abs_x1, abs_y1, abs_x2, abs_y2] to model input/output coords
input_or_output_x1 = int(abs_x1 / width * input_width)
input_or_output_y1 = int(abs_y1 / height * input_height)
input_or_output_x2 = int(abs_x2 / width * input_width)
input_or_output_y2 = int(abs_y2 / height * input_height)
// origin
{
'image': ['test.jpg'],
'instruction': "What's this in [abs_x1, abs_y1, abs_x2, abs_y2] ?",
'boxes': [abs_x1, abs_y1, abs_x2, abs_y2]
...
}
// converted
{
'image': ['test.jpg'],
'instruction': "What's this in [input_x1, input_y1, input_x2, input_y2] ?",
'boxes': [input_x1, input_y1, input_x2, input_y2]
...
}
// origin
{
'image': ['test.jpg'],
'instruction': "Where is the cat?",
'response': "It's in [abs_x1, abs_y1, abs_x2, abs_y2] ?",
'boxes': [abs_x1, abs_y1, abs_x2, abs_y2]
...
}
// converted
{
'image': ['test.jpg'],
'instruction': "Where is the cat?",
'response': "It's in [output_x1, output_y1, output_x2, output_y2] ?",
'boxes': [output_x1, output_y1, output_x2, output_y2]
...
}
Qwen3-VL
参考文章
https://github.com/QwenLM/Qwen3-VL/issues/1521
https://github.com/QwenLM/Qwen3-VL/blob/main/cookbooks/2d_grounding.ipynb
与Qwen2.5-VL的区别:
First of all, we list the major updates of Qwen3-VL’s spatial understanding abilities as follows:
Coordinate System: Qwen3-VL’s default coordinate system has been changed from the absolute coordinates used in Qwen2.5-VL to relative coordinates ranging from 0 to 1000. (You don’t need to calculate the resized_w)
Multi-Target Grounding: Qwen3-VL has improved its multi-target grounding ability.
做法
由于采用了相对的坐标,因此无需获取模型处理后的图片尺寸,只需输入模型输出的坐标及原图像的坐标即可
width, height = img.size
draw = ImageDraw.Draw(img)
abs_y1 = int(bounding_box["bbox_2d"][1] / 1000 * height) # ymin
abs_x1 = int(bounding_box["bbox_2d"][0] / 1000 * width) # xmin
abs_y2 = int(bounding_box["bbox_2d"][3] / 1000 * height) # ymax
abs_x2 = int(bounding_box["bbox_2d"][2] / 1000 * width) # xmax
if abs_x1 > abs_x2:
abs_x1, abs_x2 = abs_x2, abs_x1
if abs_y1 > abs_y2:
abs_y1, abs_y2 = abs_y2, abs_y1
# Draw the bounding box
draw.rectangle(
((abs_x1, abs_y1), (abs_x2, abs_y2)), outline=color, width=3
)
Q:为什么是1000?(截取自参考文章)
No, the image itself is not stretched or rescaled to 1000×1000. The model processes images in their original aspect ratio. We only normalize the output coordinates to a 1000×1000 reference grid for consistency.
This means the coordinate space is fixed, but the visual features are extracted from the original-resolution image. In practice, this design helps the model generalize better across different resolutions and aspect ratios without distortion.
更多推荐
所有评论(0)