NVIDIA大模型推理框架：TensorRT-LLM软件流程（二）trtllm-serve启动流程--C++ Module

TensorRT-LLM框架通过Pybind11实现Python与C++的交互。启动流程中，Python层调用tllm.ExecutorConfig会通过bindings.cpp映射到C++层的ExecutorConfig接口，并进行参数验证。Executor类同样通过C++实现，核心逻辑在Executor::Impl构造函数中完成，包括初始化并行配置、加载模型（如kDECODER_ONLY类型）

chinamaoge

781人浏览 · 2025-10-27 22:48:49

chinamaoge · 2025-10-27 22:48:49 发布

trtllm-serve启动流程–C++ Module

上一篇博主要是讲的Python module部分，这一篇博客主要是聚焦trtllm-serve 启动过程中C++ module
先介绍Python Layer调用C++ Layer使用的中间件Pybind11
个人觉得C++ 部分是非常复杂和丰富的，所以只关注了和嵌入相关并且我关注的部分

Pybind11 Layer

TensorRT-LLM框架中是使用pybind11作为中间件来连接Python和C++
例如Python流程中调用tllm.ExecutorConfig，可以调用到C++ ExecutorConfig::ExecutorConfig interface
Pybind11 module 调用链过程是：
- 1.先通过CMakeLists.txt编译生成Python可以使用的bindings module
- 2.再bindings.cpp文件中对executor绑定为bindings子模块，在Python代码中可以直接导入
- 3.最后对接口进行映射，并且对接口参数都初始化，可以让Python调用的时候可以不是必须全部传入，减少代码复杂度

//cpp\tensorrt_llm\pybind\CMakeLists.txt
set(TRTLLM_PYBIND_MODULE bindings)
set(TRTLLM_PYBIND_MODULE
    ${TRTLLM_PYBIND_MODULE}
    PARENT_SCOPE)

// cpp\tensorrt_llm\pybind\bindings.cpp
//Pybind链接关联层
auto mExecutor = m.def_submodule("executor", "Executor bindings");
tensorrt_llm::pybind::executor::initBindings(mExecutor);

//cpp\tensorrt_llm\pybind\executor\executorConfig.cpp 
//Pybind映射层，py::arg定义参数类型，.def_property类成员属性绑定
py::class_<tle::ExecutorConfig>(m, "ExecutorConfig", pybind11::dynamic_attr())
    .def(py::init<                                                   //
             SizeType32,                                             // MaxBeamWidth
             tle::SchedulerConfig const&,                            // SchedulerConfig
             tle::KvCacheConfig const&,                              // KvCacheConfig
             >(),
        py::arg("max_beam_width") = 1, py::arg_v("scheduler_config", tle::SchedulerConfig(), "SchedulerConfig()"),
        py::arg_v("kv_cache_config", tle::KvCacheConfig(), "KvCacheConfig()"),
        py::arg("enable_chunked_context") = false, py::arg("normalize_log_probs") = true,
        .def_property("max_beam_width", &tle::ExecutorConfig::getMaxBeamWidth, &tle::ExecutorConfig::setMaxBeamWidth)
        .def_property("max_batch_size", &tle::ExecutorConfig::getMaxBatchSize, &tle::ExecutorConfig::setMaxBatchSize)
        .def_property("max_num_tokens", &tle::ExecutorConfig::getMaxNumTokens, &tle::ExecutorConfig::setMaxNumTokens)
        );

##tensorrt_llm\executor\executor.py
##导入module
from ..bindings import executor as tllm

##tensorrt_llm\llmapi\llm.py
##调用点
self._executor_config = tllm.ExecutorConfig(
    max_beam_width=self.args.max_beam_width,
    scheduler_config=PybindMirror.maybe_to_pybind(
        self.args.scheduler_config),
    batching_type=PybindMirror.maybe_to_pybind(self.args.batching_type)
    or tllm.BatchingType.INFLIGHT,
    max_batch_size=max_batch_size,
    max_num_tokens=max_num_tokens,
    gather_generation_logits=self.args.gather_generation_logits,
    fail_fast_on_attention_window_too_large=getattr(
        self.args, 'fail_fast_on_attention_window_too_large', False))

ExecutorConfig

ExecutorConfig接口实现在cpp\tensorrt_llm\executor\executorConfig.cpp
ExecutorConfig实现很简单只是对参数的有效性进行验证

// cpp\tensorrt_llm\executor\executorConfig.cpp
ExecutorConfig::ExecutorConfig(SizeType32 maxBeamWidth, SchedulerConfig schedulerConfig, KvCacheConfig kvCacheConfig,
    bool enableChunkedContext, bool normalizeLogProbs, SizeType32 iterStatsMaxIterations,
    SizeType32 requestStatsMaxIterations, BatchingType batchingType)
    : mMaxBeamWidth(maxBeamWidth)
    , mSchedulerConfig(std::move(schedulerConfig))
    , mKvCacheConfig(std::move(kvCacheConfig))
    , mEnableChunkedContext(enableChunkedContext)
    , mNormalizeLogProbs(normalizeLogProbs)
    , mIterStatsMaxIterations(iterStatsMaxIterations)
    , mRequestStatsMaxIterations(requestStatsMaxIterations)
    , mBatchingType(batchingType)
    , mMaxBatchSize(maxBatchSize)
{
    TLLM_CHECK(iterStatsMaxIterations >= 0);
    TLLM_CHECK(requestStatsMaxIterations >= 0);
    TLLM_CHECK(mMaxBeamWidth > 0);
    TLLM_CHECK(maxSeqIdleMicroseconds > 0);
}

Executor

tllm.Executor也是调用到C++ Layer实现，流程和ExecutorConfig类似只是文件名为executor.cpp，直接来看实现代码

//cpp\tensorrt_llm\pybind\executor\executor.cpp
Executor::Executor(
    std::filesystem::path const& modelPath, tle::ModelType modelType, tle::ExecutorConfig const& executorConfig)
{
    mExecutor = std::make_unique<tle::Executor>(modelPath, modelType, executorConfig);
}

// cpp\tensorrt_llm\executor\executor.cpp
Executor::Executor(std::filesystem::path const& modelPath, ModelType modelType, ExecutorConfig const& executorConfig)
    : mImpl(std::make_unique<Executor::Impl>(modelPath, std::nullopt, modelType, executorConfig))
{
}

//cpp\tensorrt_llm\executor\executorImpl.cpp
Executor::Impl::Impl(std::filesystem::path const& modelPath,
    std::optional<std::filesystem::path> const& encoderModelPath, ModelType const modelType,
    ExecutorConfig const& executorConfig)

具体代码实现Impl来承接，执行到Executor::Impl::Impl构造函数
initializeCommAndWorkers初始化parallelConfig
modelType为kDECODER_ONLY，直接执行loadModel function
createModel初始化处理request的batch方式赋值到gptModelType
create中instantiate TrtGptModelInflightBatching，Inflight Batching是TensorRT LLM设计的新特性，能在推理过程中动态合并新到达的请求。
TrtGptModelInflightBatching 函数中具有GPT 模型结构、TensorRT 推理能力、Inflight Batching 调度逻辑三者解耦并初始化等功能
inflight batching技术可以最大的利用GPU资源，在相同时间处理更多的requests
loadModel结束之后会执行initialize，在其中会获取参数和Launch the execution thread，这个executionLoop在后面会用到
_create_engine执行结束之后，从.engine file同级目录下读取config.json
通过_engine_config_to_model_config function instantiate data class ModelConfig，把config.json中的配置读取到_runtime_model_config，可以让TensorRT LLM在运行过程中获取优化后engine file的具体内部参数

// TensorRT-LLM加载config之后的打印
[TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen
[TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False
[TRT-LLM] [I] Set dtype to float16.
[TRT-LLM] [I] Set bert_attention_plugin to auto.
[TRT-LLM] [I] Set gpt_attention_plugin to auto.
[TRT-LLM] [I] Set gemm_plugin to float16.
[TRT-LLM] [I] Set explicitly_disable_gemm_plugin to False.

流程图:
Executor

TrtGptModelInflightBatching

单独把TrtGptModelInflightBatching function拿出来做一个章节是因为这个函数功能非常多
初始化TrtGptModelInflightBatching首先会初始化父类TrtGptModel，TrtGptModel构造函数会从executorConfig中读取各种参数值设置，后面会在实际处理的过程中更改
接着instantiate Class TllmRuntime，StreamReader function读取.engine format 文件，binary读取方式。
deserializeCudaEngine将预优化的.engine file 反序列化为 ICudaEngine 对象，ICudaEngine 是 TensorRT 的核心：包含了优化后的模型计算图（如层结构、算子融合结果）、输入输出张量定义、GPU 计算逻辑，具体实现是在TensorRT中
createEngineInspector读取Engine file的内部细节检查
mBufferManager为getDeviceMemorySizeV2分配计算图元数据、静态中间缓存、非流传输权重作为execution context memory，到这里总的实现序列化引擎文件到可执行 GPU 推理环境的转换

// cpp\tensorrt_llm\runtime\tllmRuntime.cpp
TllmRuntime::TllmRuntime() {
	auto reader = StreamReader(rawEngine.getPath());
	mEngine.reset(mRuntime->deserializeCudaEngine(reader));
	
	mEngineInspector.reset(mEngine->createEngineInspector());
	assessLikelihoodOfRuntimeAllocation(*mEngine, *mEngineInspector);
	setWeightStreaming(getEngine(), gpuWeightsPercent);
	auto const devMemorySize = mEngine->getDeviceMemorySizeV2();
	mEngineBuffer = mBufferManager.gpu(devMemorySize); }

模型读取完成之后，check mModelConfig、mWorldConfig、kvCacheConfig配置
createBuffers创建出getMaxNumSequences数量的SlotDecoderBuffers放置在queue中，这个步骤会计算消耗的GPU memory
createDecoder function创建出DecoderState and GptDecoder instance分别setup，GptDecoder中内部逻辑中也是分配GPU memory。如下面代码片所示，这是GptDecoder function多次调用到的结果，其中mBufferManager的操作就是在分配GPU memory。

// cpp\tensorrt_llm\layers\dynamicDecodeLayer.cpp
template <typename T>
void DynamicDecodeLayer<T>::initialize()
{
    TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);

    mOutputIdsPtrHost = mBufferManager->pinnedPool(ITensor::makeShape({}), TRTDataType<TokenIdType*>::value);
    mParentIdsPtrHost = mBufferManager->pinnedPool(ITensor::makeShape({}), TRTDataType<TokenIdType*>::value);
    mOutputIdsPtrDevice = mBufferManager->gpu(
        ITensor::makeShape({static_cast<SizeType32>(mDecoderDomain.getBatchSize())}), TRTDataType<TokenIdType*>::value);
    mParentIdsPtrDevice = mBufferManager->gpu(
        ITensor::makeShape({static_cast<SizeType32>(mDecoderDomain.getBatchSize())}), TRTDataType<TokenIdType*>::value);

    allocateBuffer();

    mCyclicStep = 0;
    mRuntimeMaxSeqLen = 0;
    mConfiguredBeamWidth = -1;

    if (!mDecodingMode.isAuto())
    {
        mConfiguredBeamWidth = mDecoderDomain.getBeamWidth();
        initializeLayers();
    }

    TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
}

在这里介绍这部分代码的原因是想介绍mBufferManager，TensorRT-LLM 还有一个介绍名字：A TensorRT Toolbox for Optimized Large Language Model Inference
nvinfer1 namespace就来自于#include <NvInferRuntime.h>头文件，而这个头文件就来自于TensorRT，并且是其中核心组件的头文件
TensorRT-LLM很多很多核心功能都是依赖于TensorRT实现

// cpp\tensorrt_llm\layers\baseLayer.h
std::shared_ptr<runtime::BufferManager> mBufferManager;

//cpp\include\tensorrt_llm\runtime\bufferManager.h
#pragma once
#include "tensorrt_llm/common/assert.h"
#include "tensorrt_llm/runtime/cudaStream.h"
#include "tensorrt_llm/runtime/iBuffer.h"
#include "tensorrt_llm/runtime/iTensor.h"
#include <NvInferRuntime.h>
#include <cstring>
#include <memory>
#include <set>
#include <string>
#include <vector>

class BufferManagerTest;
namespace tensorrt_llm::runtime
{
/// @brief Forward declaration as only used through pointer.
class CudaMemPool;
//! \brief A helper class for managing memory on host and device.
class BufferManager
{
public:
    using IBufferPtr = IBuffer::UniquePtr;
    using ITensorPtr = ITensor::UniquePtr;
    explicit BufferManager(CudaStreamPtr stream, bool trimPool = false);

    static auto constexpr kBYTE_TYPE = nvinfer1::DataType::kUINT8;
    //! \brief Allocates an `IBuffer` of the given size on the GPU, using cudaMallocAsync.
    [[nodiscard]] IBufferPtr gpu(std::size_t size, nvinfer1::DataType type = kBYTE_TYPE) const;
    //! \brief Allocates an `IBuffer` of the given size on the GPU, using cudaMalloc.
    [[nodiscard]] static IBufferPtr gpuSync(std::size_t size, nvinfer1::DataType type = kBYTE_TYPE);
};}

createKvCacheManager 创建管理KV cache instance，和设置KV cache Pools，一般是有两级cache pool，primary GPU和secondary CPU cache pool，其中GPU速度最快。
会对pool划分block更加细致管理kv cache，我调试打印primaryBlocks=412, .secondayBlocks=0，Number of tokens per block=32，没有打开CPU cache pool，设置的Max Attention window Size=13184。
需要满足公式：单 block token 数 * block数量 >= windowSize，代码中直接就是相等的
实例化CapacityScheduler这是选择动态批处理（Inflight Batching）执行器的三种容量调度策略，因为hasKvCacheManager enable，所以实例化MaxUtilizationScheduler。

// cpp\include\tensorrt_llm\executor\types.h
/// @brief The policy used to select the subset of available requests in each iteration of the executor generation loop
enum class CapacitySchedulerPolicy
{
    /// @brief MAX_UTILIZATION packs as many requests as the underlying TRT engine can support in any iteration of the
    /// InflightBatching generation loop. While this is expected to maximize GPU throughput, it might require that some
    /// requests be paused and restarted depending on peak KV cache memory availability.
    kMAX_UTILIZATION = 0,

    /// @brief GUARANTEED_NO_EVICT uses KV cache more conservatively guaranteeing that a request, once started, will run
    /// to completion without eviction.
    kGUARANTEED_NO_EVICT = 1,

    /// @brief kSTATIC_BATCH does not schedule new requests until all requests in current batch are completed.
    /// Similar to kGUARANTEED_NO_EVICT, requests will run to completion without eviction.
    kSTATIC_BATCH = 2
};

最后部分代码是推测性解码（Speculative Decoding）优化和解码器请求管理的核心逻辑。
流程图：

summary

通过一和二两篇博客，我们分析了TensorRT-LLM加载engine模型的流程，python提供API给应用层调用，通过pybind中间件调用到C++，耗时和具体处理部分都是C++实现。
模型初始化完整之后运行uvicorn.Server，接收http请求
接下来分析：TensorRT LLM接收到 request 请求并且处理的流程：trtllm-serve启动流程–HTTP Request

Tips

TLLM_LOG_LEVEL环境变量可以设置TensorRT-LLM 打印log level 方便调试

// 设置log level debug
export TLLM_LOG_LEVEL=DEBUG
// 设置log level info
export TLLM_LOG_LEVEL=INFO

运行报错解决经验分享

Bus error

根本原因就是Docker 运行容器中physical memory不足，kill一些进程就能解决，因为我是在编译的过程中去运行大模型就会出现这样的问题，单独运行其中一个进程就没问题

final link failed

这个ld链接错误的根本原因我没有找到…
解决办法是：重启系统之后就可以编译通过了

火山引擎 ADG 社区

火山引擎开发者社区是火山引擎打造的AI技术生态平台，聚焦Agent与大模型开发，提供豆包系列模型（图像/视频/视觉）、智能分析与会话工具，并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长，新用户可领50万Tokens权益，助力构建智能应用。

更多推荐

OpenClaw 本地部署完整指南（Windows + Ollama）

本文档基于实际部署经验编写，旨在帮助你在 Windows 系统上从零开始搭建 OpenClaw，并连接本地 Ollama 模型（如 Qwen2.5 或 Qwen3），使其具备完整的智能体能力。文档包含了所有关键步骤以及常见问题的解决方案。

火山引擎 ADG 社区

OpenClaw 小白安装指南（Windows版）

（类似一个能自动执行任务的AI机器人），不是游戏。API Key只保存在你本地电脑的加密文件里，不会上传到任何地方。访问：https://github.com/miaoxworld/openclaw-manager/releases。: 一键安装脚本会自动安装Node.js 22+，如果失败，手动下载安装：https://nodejs.org/：在PowerShell中，鼠标右键就是粘贴，不需要按

火山引擎 ADG 社区

飞书 × OpenClaw 接入指南：不用服务器，用长连接把机器人跑起来

这个项目存在的意义，就是把“飞书接 OpenClaw”这件事，整理成一套的配置入口，并把官方文档没覆盖到的坑集中写成排查清单。先说清楚它的角色：OpenClaw 现在已经内置官方飞书插件 @openclaw/feishu，功能更完整、维护也更及时。，说明飞书 + AI 的接入已经走通。另外，仓库也推荐了一个新项目：把 OpenClaw 变成“多 Agent 团队”，用多个 Agent 分工，Sla