背景

使用VLM做多模态object detection, specific target grounding、OCR时,涉及到由模型给出识别到的bonding box,然而,由于VLM模型会对输入图片做预处理(resize为另一个大小),因此需要对模型输出的bonding box坐标做转换。实验过程中发现Qwen2.5-VL和Qwen3-VL输出的坐标转换后得到的结果不一致,因此写此文进行分享。

Qwen2.5-VL

参考文章
https://github.com/QwenLM/Qwen3-VL/issues/931

以下代码均参考参考文章

做法

  1. 获取预处理后图片的尺寸(width,height)
# Single image processing
inputs = processor(images=[image], return_tensors="pt")

# one grid has a size of 14x14, image_grid_thw is a list of [t, h_grid, w_grid], for images, t is always 1
input_height = inputs['image_grid_thw'][0][1]*14 
input_width = inputs['image_grid_thw'][0][2]*14

# Multiple image processing
inputs = processor(images=[image1, image2], return_tensors="pt")
input_height_idx = inputs['image_grid_thw'][idx][1]*14
input_width_idx = inputs['image_grid_thw'][idx][2]*14.  

2.根据模型输出的坐标做转换

# Given original size(原始图片的尺寸) img.size: width, height
# Given model input size(第一步得到的输入尺寸): input_height, input_width

# Converting model output(模型输出的bonding box)(xmin,ymin,xmax,ymax) [output_x1, output_y1, output_x2, output_y2] to actual coords [abs_x1, abs_y1, abs_x2, abs_y2]
abs_x1 = int(output_x1 / input_width * width)
abs_y1 = int(output_y1 / input_height * height)
abs_x2 = int(output_x2 / input_width * width)
abs_y2 = int(output_y2 / input_height * height)

# Visualization can then be performed on the original image using these absolute coordinates.
  1. 在原始图片上绘制识别框即可
  2. 注:构建训练数据集训练模型时也需要对原始输入的坐标转变为模型的输入坐标
# Convert actual coords [abs_x1, abs_y1, abs_x2, abs_y2] to model input/output coords
input_or_output_x1 = int(abs_x1 / width * input_width)
input_or_output_y1 = int(abs_y1 / height * input_height)
input_or_output_x2 = int(abs_x2 / width * input_width)
input_or_output_y2 = int(abs_y2 / height * input_height)
// origin
{
	'image': ['test.jpg'],
	'instruction': "What's this in [abs_x1, abs_y1, abs_x2, abs_y2] ?",
        'boxes': [abs_x1, abs_y1, abs_x2, abs_y2]
	...
}

// converted
{
	'image': ['test.jpg'],
	'instruction': "What's this in [input_x1, input_y1, input_x2, input_y2] ?",
        'boxes': [input_x1, input_y1, input_x2, input_y2]
	...
}

// origin
{
	'image': ['test.jpg'],
	'instruction': "Where is the cat?",
        'response': "It's in [abs_x1, abs_y1, abs_x2, abs_y2] ?",
        'boxes': [abs_x1, abs_y1, abs_x2, abs_y2]
	...
}

// converted
{
	'image': ['test.jpg'],
	'instruction': "Where is the cat?",
        'response': "It's in [output_x1, output_y1, output_x2, output_y2] ?",
        'boxes': [output_x1, output_y1, output_x2, output_y2]
	...
}

Qwen3-VL

参考文章
https://github.com/QwenLM/Qwen3-VL/issues/1521
https://github.com/QwenLM/Qwen3-VL/blob/main/cookbooks/2d_grounding.ipynb

与Qwen2.5-VL的区别:

First of all, we list the major updates of Qwen3-VL’s spatial understanding abilities as follows:
Coordinate System: Qwen3-VL’s default coordinate system has been changed from the absolute coordinates used in Qwen2.5-VL to relative coordinates ranging from 0 to 1000. (You don’t need to calculate the resized_w)
Multi-Target Grounding: Qwen3-VL has improved its multi-target grounding ability.

做法

由于采用了相对的坐标,因此无需获取模型处理后的图片尺寸,只需输入模型输出的坐标及原图像的坐标即可

width, height = img.size
draw = ImageDraw.Draw(img)

abs_y1 = int(bounding_box["bbox_2d"][1] / 1000 * height) # ymin
abs_x1 = int(bounding_box["bbox_2d"][0] / 1000 * width) # xmin
abs_y2 = int(bounding_box["bbox_2d"][3] / 1000 * height) # ymax
abs_x2 = int(bounding_box["bbox_2d"][2] / 1000 * width) # xmax

if abs_x1 > abs_x2:
    abs_x1, abs_x2 = abs_x2, abs_x1

if abs_y1 > abs_y2:
    abs_y1, abs_y2 = abs_y2, abs_y1

# Draw the bounding box
draw.rectangle(
    ((abs_x1, abs_y1), (abs_x2, abs_y2)), outline=color, width=3
)

Q:为什么是1000?(截取自参考文章)

No, the image itself is not stretched or rescaled to 1000×1000. The model processes images in their original aspect ratio. We only normalize the output coordinates to a 1000×1000 reference grid for consistency.
This means the coordinate space is fixed, but the visual features are extracted from the original-resolution image. In practice, this design helps the model generalize better across different resolutions and aspect ratios without distortion.

Logo

中国智能体开发者社区,聚焦智能体与大模型开发,提供前沿资讯、实用工具链、开源项目及行业案例。通过技术沙龙、开发者大赛等活动,促进经验交流与协作,助力开发者快速构建创新智能应用。

更多推荐