kvcache-ai/ktransformers: A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations 源码官网

Introduction - Ktransformers 官方文档

一、硬件环境:

主板: X13SWA-TF

CPU:Intel Xeon 至强 w9-3575x 

显卡:Intel Arc A770

二、软件环境:

操作系统:Ubuntu 25.04 Intel架构版

Ubuntu 25.04 开始更加适配intel显卡。22.04 、24.10 这两个Ubuntu版本能够源码编译ktransformers xpu版。25.04源码编译ktransformers会提示找不到cuda和xpu从而无法源码安装,只能通过desktop安装。

Ubuntu 25.04 源码编译ktransformers会出现下面的问题:

valueError: Unsupported backend: CUDA_HOME ROCM_HOME MUSA_HOME are not set and XPU is not available.

三、必备准备工作:
1、安装Intel GPU驱动:

client GPU :
Installing Client GPUs — Intel® software for general purpose GPU capabilities documentation
data Center GPU :
Installing Data Center GPU: LTS Releases — Intel® software for general purpose GPU capabilities documentation

备注:只有 Intel® Data Center GPU Max 系列和 Intel® Data Center GPU Flex 请安装 data Center GPU
其他系列如 Intel® Arc™ A-series 一律安装client GPU (好消息:安装 client GPU 比安装 data Center GPU 要容易的多)

2、安装Intel oneAPI

Get the Intel® oneAPI Base Toolkit

注意安装完后一定要在终端使用以下两条命令,否则llama.cpp 识别不到SYCL后端:

sudo apt update sudo apt -y install cmake pkg-config build-essential

四、ktransformers 版本:

ktransformers/doc/en/xpu.md at main · kvcache-ai/ktransformers xpu源码编译指南

ktransformers/doc/en/Docker_xpu.md at main · kvcache-ai/ktransformers docker 安装指南

Introduction - Ktransformers 官方文档

五、研究成果

使用 deepseek-r1-671b Q4_K_M ,单GPU版 速度为3.0~ 3.7 t/s,4GPU版 速度为0.5 t

4GPU版配置文件:

# === Embed Tokens ===
- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"

# === Rotary Embedding ===
- match:
    name: "^model\\.layers\\.(0|[1-9]|1[0-4])\\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
      generate_device: "xpu:0"
      prefill_device: "xpu:0"

- match:
    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
      generate_device: "xpu:1"
      prefill_device: "xpu:1"

- match:
    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
      generate_device: "xpu:2"
      prefill_device: "xpu:2"

- match:
    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
      generate_device: "xpu:3"
      prefill_device: "xpu:3"

# === Linear Layers (including kv_b_proj) ===
- match:
    name: "^model\\.layers\\.(0|[1-9]|1[0-4])\\..*"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "xpu:0"
      prefill_device: "xpu:0"
      generate_op: "KLinearIPEXLLM"
      prefill_op: "KLinearIPEXLLM"

- match:
    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\..*"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "xpu:1"
      prefill_device: "xpu:1"
      generate_op: "KLinearIPEXLLM"
      prefill_op: "KLinearIPEXLLM"

- match:
    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\..*"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "xpu:2"
      prefill_device: "xpu:2"
      generate_op: "KLinearIPEXLLM"
      prefill_op: "KLinearIPEXLLM"

- match:
    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\..*"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "xpu:3"
      prefill_device: "xpu:3"
      generate_op: "KLinearIPEXLLM"
      prefill_op: "KLinearIPEXLLM"

# === MLP ===
- match:
    name: "^model\\.layers\\.(0|[1-9]|1[0-4])\\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE
    kwargs:
      generate_device: "xpu:0"
      prefill_device: "xpu:0"

- match:
    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE
    kwargs:
      generate_device: "xpu:1"
      prefill_device: "xpu:1"

- match:
    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE
    kwargs:
      generate_device: "xpu:2"
      prefill_device: "xpu:2"

- match:
    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE
    kwargs:
      generate_device: "xpu:3"
      prefill_device: "xpu:3"

# === MoE Gate ===
- match:
    class: ktransformers.models.modeling_deepseek_v3.MoEGate
  replace:
    class: ktransformers.operators.gate.KMoEGateIPEXLLM
    kwargs:
      generate_device: "xpu:0"
      prefill_device: "xpu:0"

# === MoE Experts ===
- match:
    name: "^model\\.layers\\.(0|[1-9]|1[0-4])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "xpu:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op: "KExpertsCPU"
      out_device: "xpu:0"
  recursive: False

- match:
    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "xpu:1"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op: "KExpertsCPU"
      out_device: "xpu:1"
  recursive: False

- match:
    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "xpu:2"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op: "KExpertsCPU"
      out_device: "xpu:2"
  recursive: False

- match:
    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "xpu:3"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op: "KExpertsCPU"
      out_device: "xpu:3"
  recursive: False

# === Self Attention ===
- match:
    name: "^model\\.layers\\..*\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "xpu"
      prefill_device: "xpu"
      absorb_for_prefill: False

# === LayerNorm ===
- match:
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RMSNorm
  replace:
    class: ktransformers.operators.layernorm.KDeepseekRMSNormIPEXLLM
    kwargs:
      generate_device: "xpu"
      prefill_device: "xpu"

# === Final lm_head ===
- match:
    name: "^lm_head$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "xpu:3"
      prefill_device: "xpu:3"
      generate_op: "KLinearIPEXLLM"
      prefill_op: "KLinearIPEXLLM"

# === Final Norm + Layers on xpu:3 ===
- match:
    name: "(^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.)|(^model\\.norm)"
  replace:
    class: "default"
    kwargs:
      generate_device: "xpu:3"
      prefill_device: "xpu:3"

# === Top-Level Model Wrapper ===
- match:
    name: "^model$"
  replace:
    class: ktransformers.operators.models.KDeepseekV2Model
    kwargs:
      per_layer_prefill_intput_threshold: 0

Logo

火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。

更多推荐