MNN 支持 DeepSeekVL 技术文档

DeepSeekVL (https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat/) 是 DeepSeek 开发的多模态大语言模型。7b 模型可以基于 MNN (https://github.com/alibaba/MNN/)在高端手机上运行,因此进行了一下适配。

一、模型结构分析

先按官方文档进行环境配置

git clone https://github.com/deepseek-ai/DeepSeek-VL
cd DeepSeek-VL

pip install -e .

然后修改一下运行脚本(只需要把 cuda() 去掉,以支持在没有cuda的设备上运行):

import torch
from modelscope import AutoModelForCausalLM

from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images


# specify the path to the model
model_path = "deepseek-ai/deepseek-vl-7b-chat"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
print(vl_chat_processor)

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).eval()
print(vl_gpt)

conversation = [
    {
        "role": "User",
        "content": "<image_placeholder>Describe each stage of this image.",
        "images": ["./images/training_pipelines.png"]
    },
    {
        "role": "Assistant",
        "content": ""
    }
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

代码中的 vl_gpt 即模型结构,打印结果如下:

MultiModalityCausalLM(
  (vision_model): HybridVisionTower(
    (vision_tower_high): CLIPVisionTower(
      (vision_tower): ImageEncoderViT(
        (patch_embed): PatchEmbed(
          (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
        )
        (blocks): ModuleList(
          (0-11): 12 x Block(
            (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=768, out_features=2304, bias=True)
              (proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
            (mlp): MLPBlock(
              (lin1): Linear(in_features=768, out_features=3072, bias=True)
              (lin2): Linear(in_features=3072, out_features=768, bias=True)
              (act): GELU(approximate='none')
            )
          )
        )
        (neck): Sequential(
          (0): Conv2d(768, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): LayerNorm2d()
          (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (3): LayerNorm2d()
        )
        (downsamples): Sequential(
          (0): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (1): Conv2d(512, 1024, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        )
        (neck_hd): Sequential(
          (0): Conv2d(768, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): LayerNorm2d()
          (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (3): LayerNorm2d()
        )
      )
      (image_norm): Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711])
    )
    (vision_tower_low): CLIPVisionTower(
      (vision_tower): VisionTransformer(
        (patch_embed): PatchEmbed(
          (proj): Conv2d(3, 1024, kernel_size=(16, 16), stride=(16, 16))
          (norm): Identity()
        )
        (pos_drop): Dropout(p=0.0, inplace=False)
        (patch_drop): Identity()
        (norm_pre): Identity()
        (blocks): Sequential(
          (0): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (1): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (2): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (3): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (4): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (5): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (6): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (7): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (8): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (9): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (10): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (11): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (12): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (13): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (14): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (15): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (16): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (17): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (18): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (19): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (20): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (21): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (22): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
          (23): Block(
            (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (attn): Attention(
              (qkv): Linear(in_features=1024, out_features=3072, bias=True)
              (q_norm): Identity()
              (k_norm): Identity()
              (attn_drop): Dropout(p=0.0, inplace=False)
              (proj): Linear(in_features=1024, out_features=1024, bias=True)
              (proj_drop): Identity()
            )
            (ls1): Identity()
            (drop_path1): Identity()
            (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
            (mlp): Mlp(
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (act): GELU(approximate='none')
              (drop1): Dropout(p=0.0, inplace=False)
              (norm): Identity()
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (drop2): Dropout(p=0.0, inplace=False)
            )
            (ls2): Identity()
            (drop_path2): Identity()
          )
        )
        (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (attn_pool): AttentionPoolLatent(
          (q): Linear(in_features=1024, out_features=1024, bias=True)
          (kv): Linear(in_features=1024, out_features=2048, bias=True)
          (q_norm): Identity()
          (k_norm): Identity()
          (proj): Linear(in_features=1024, out_features=1024, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
          (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1024, out_features=4096, bias=True)
            (act): GELU(approximate='none')
            (drop1): Dropout(p=0.0, inplace=False)
            (norm): Identity()
            (fc2): Linear(in_features=4096, out_features=1024, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
        )
        (fc_norm): Identity()
        (head_drop): Dropout(p=0.0, inplace=False)
        (head): Identity()
      )
      (image_norm): Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
    )
    (high_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (low_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (resize): Resize(size=384, interpolation=bilinear, max_size=None, antialias=True)
  )
  (aligner): MlpProjector(
    (high_up_proj): Linear(in_features=1024, out_features=2048, bias=True)
    (low_up_proj): Linear(in_features=1024, out_features=2048, bias=True)
    (layers): Sequential(
      (0): GELU(approximate='none')
      (1): Linear(in_features=4096, out_features=4096, bias=True)
    )
  )
  (language_model): LlamaForCausalLM(
    (model): LlamaModel(
      (embed_tokens): Embedding(102400, 4096)
      (layers): ModuleList(
        (0-29): 30 x LlamaDecoderLayer(
          (self_attn): LlamaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          )
          (mlp): LlamaMLP(
            (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
            (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
            (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): LlamaRMSNorm((4096,), eps=1e-06)
          (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-06)
        )
      )
      (norm): LlamaRMSNorm((4096,), eps=1e-06)
      (rotary_emb): LlamaRotaryEmbedding()
    )
    (lm_head): Linear(in_features=4096, out_features=102400, bias=False)
  )
)

它由以下核心组件构成:

  • 视觉模型(Vision Model): HybridVisionTower,包含 ViT(Vision Transformer)和 ResNet 混合架构,用于提取图像特征,类变量名为 vision_model
  • 对齐层(Aligner): MlpProjector,将视觉模型输出的特征映射到语言模型的嵌入空间,类变量名为aligner
  • 语言模型(Language Model): 处理文本和多模态上下文,类变量名为language_model

二、大语言模型部分导出

1. 加载

修改 transformers/llm/export/llm_export.py ,对于 deepseek-vl模型,采用单独的加载代码。由于默认的model_typemulti_modality ,不具可区分性,修改为 deepseek-vl

elif 'deepseek-vl' in model_path:
    from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
    vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
    self.tokenizer = vl_chat_processor.tokenizer
    self.processor = vl_chat_processor
    vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).eval()
    self.model = vl_gpt
    self.model.config.model_type = "deepseek-vl"

2. 映射

根据模型结构,在 transformers/llm/export/utils/model_mapper.py 中增加 deepseek-vl的映射表:

    def regist_deepseek_vl(self):
        deepseek_vlmap = {
            'config': {
                'hidden_size': 'language_config.hidden_size',
                'num_attention_heads': 'language_config.num_attention_heads',
                'num_hidden_layers': 'language_config.num_hidden_layers',
                'rope_theta': 'language_config.rope_theta',
                'head_dim': 'language_config.head_dim',
                'num_key_value_heads': 'language_config.num_key_value_heads',
            },
            'model': {
                'lm_': 'language_model.lm_head',
                'embed_': 'language_model.model.embed_tokens',
                'blocks_': 'language_model.model.layers',
                'final_layernorm_': 'language_model.model.norm',
                # 'visual': 'vision_model'
            },
            'decoder': {
                'self_attn': 'self_attn',
                'mlp': 'mlp',
                'input_layernorm': 'input_layernorm',
                'post_attention_layernorm': 'post_attention_layernorm'
            },
            'attention': {
                'q_proj': 'q_proj',
                'k_proj': 'k_proj',
                'v_proj': 'v_proj',
                'o_proj': 'o_proj'
            }
        }    
        self.regist('deepseek-vl', deepseek_vlmap)

*** 先关闭 visual 模型的导出 ***

3. 模板

deepseek 的语言模型有自己的特殊模板,在 transformers/llm/export/llm_export.pybuild_prompt_template 函数定义:

if 'DeepSeek' or 'deepseek' in self.args.path:
    template['bos'] = '<|begin_of_sentence|>'
    template['system'] = '{content}\n'
    template['user'] = '\nUser: {content}\n'
    template['assistant'] = '\nAssistant: {content}\n<|end_of_sentence|>'

4. 导出与测试

执行 python llmexport.py --path deepseek-vl --export mnn 即可导出大语言模型,文本对话简单测试后无误。

三、图像模块导出

1. 处理流程分析

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
)
# ......
VLChatProcessor::process_one
        images_outputs = self.image_processor(images, return_tensors="pt")

阅读代码可知图像处理分两步:

  • 预处理:VLMImageProcessor ,做缩放和 1.0 / 255.0 的数值变化。
  • 计算 embedding: MultiModalityCausalLM
    • images = rearrange(pixel_values, “b n c h w -> (b n) c h w”)
    • images_embeds = self.aligner(self.vision_model(images))

分别打印 pixel_valuesimages_embeds 的 shape :

        print(pixel_values.shape)
        images = rearrange(pixel_values, "b n c h w -> (b n) c h w")
        print(images.shape)
        # [b x n, T2, D]
        images_embeds = self.aligner(self.vision_model(images))
        print(images_embeds.shape)

结果如下:

torch.Size([1, 1, 3, 1024, 1024])
torch.Size([1, 3, 1024, 1024])
torch.Size([1, 576, 4096])

结论:

  • rearrange 这步不用做,MNN LLM 目前不支持 batch 图像输入
  • 需要两个类: alignervision_model
  • embedding 的维度是 batch, seq_len, hidden_size ,需要转置为 seq_len, batch, hidden_size

2. 图像模型导出

  • 修改 model_mapper.py ,加上 visual 模型的映射

  • transformers/llm/export/vision.py 增加 deepseek-vl 对应的 vision 类:

class DeepSeekVL(Vision):
    def __init__(self, visual, base):
        super().__init__(visual, base)
        self.quant_bit = 8
        self.aligner = base.model.aligner
        self.vision_model = visual

    def load(self):
        self.image_size = 1024
        self.llm_config['is_visual'] = True
        self.llm_config['image_size'] = self.image_size
        # self.llm_config['vision_start'] = self.tokenizer.img_start_id
        # self.llm_config['vision_end'] = self.tokenizer.img_end_id
        # self.llm_config['image_pad'] = self.tokenizer.img_pad_id
    def init_config(self):
        self.llm_config['is_visual'] = True
        IMAGENET_MEAN = [0.0, 0.0, 0.0]
        IMAGENET_STD = [1.0,1.0,1.0]
        for i in range(3):
            IMAGENET_MEAN[i] = IMAGENET_MEAN[i] * 255.0
            IMAGENET_STD[i] = 1.0 / IMAGENET_STD[i] / 255.0
        self.llm_config['image_mean'] = IMAGENET_MEAN
        self.llm_config['image_norm'] = IMAGENET_STD
        self.llm_config['image_size_unit'] = 14
    def export(self, onnx_path):
        input_images = torch.randn((1, 3, self.image_size, self.image_size), dtype=torch.float32)
        onnx_model = f'{onnx_path}/visual.onnx'
        torch.onnx.export(self, (input_images),
                        onnx_model,
                        input_names=['input_images'],
                        output_names=['image_embeds'],
                        dynamic_axes={
                            "input_images": { 0: "size", 2: "height", 3: "width"},
                        },
                        do_constant_folding=True,
                        verbose=False,
                        opset_version=15)
        return onnx_model
    def forward(self, images):
        vit_embeds = self.aligner(self.vision_model(images))
        # For mnn's embedding, the order is (seq, batch, hidden)
        vit_embeds = vit_embeds.permute(1, 0, 2)
        return vit_embeds

ONNX 导出问题与解决

在导出视觉模型到 ONNX 时,遇到不支持的算子:

torch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::_upsample_bilinear2d_aa' to ONNX opset version 15 is not supported.
  • 修改 HybridVisionTower 的 Resize 层
    # 将 antialias=True 改为 False
    self.resize = torchvision.transforms.Resize(self.low_res_size, antialias=False)
    

然后可以完整导出

遗留问题

  • 无论 image 的尺寸是多少,输出的 embedding 尺寸不变,原因是
  • 视觉模型中的固定尺寸插值
    sam.pyneck 模块中,强制将特征图插值到 96x96
    x = F.interpolate(x.float(), size=(96, 96), mode="bilinear", align_corners=False)  # 固定尺寸
    

四、测试

使用如下 prompt 测试:

<img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg<hw>1024,1024</hw></img>介绍一下图片里的内容

结果如下:

The device supports: i8sdot:1, fp16:1, i8mm: 0, sve2: 0, sme2: 0
config path is /Users/xtjiang/alicnn/deepseek-vl-7b-chat-MNN/config.json
main, 227, cost time: 2062.623047 ms
Prepare for tuning opt Begin
Prepare for tuning opt End
main, 231, cost time: 1398.634033 ms
prompt file is /Users/xtjiang/alicnn/AliNNPrivate/build/pic2.txt
File has been downloaded successfully.
这张图片捕捉了海滩上的一个温馨时刻。一位女士和她的狗坐在沙滩上,享受着落日的余晖。女士穿着格子衬衫和牛仔裤,坐在沙滩上,双腿交叉。她正在抚摸着坐在她旁边的一只小狗。小狗穿着蓝色的背带,正在回应这种爱抚,并伸出舌头舔女士的脸。太阳正在落山,给海滩上的场景投下了温暖的金色光芒。背景中的海洋平静,为这个亲密的场景增添了宁静的氛围。

#################################
prompt tokens num = 592
decode tokens num = 106
 vision time = 3.35 s
  audio time = 0.00 s
prefill time = 8.52 s
 decode time = 5.64 s
 sample time = 0.02 s
prefill speed = 69.45 tok/s
 decode speed = 18.80 tok/s
##################################

Logo

中国智能体开发者社区,聚焦智能体与大模型开发,提供前沿资讯、实用工具链、开源项目及行业案例。通过技术沙龙、开发者大赛等活动,促进经验交流与协作,助力开发者快速构建创新智能应用。

更多推荐