trtllm-serve启动流程–C++ Module

  • 上一篇博主要是讲的Python module部分,这一篇博客主要是聚焦trtllm-serve 启动过程中C++ module
  • 先介绍Python Layer调用C++ Layer使用的中间件Pybind11
  • 个人觉得C++ 部分是非常复杂和丰富的,所以只关注了和嵌入相关并且我关注的部分

Pybind11 Layer

  • TensorRT-LLM框架中是使用pybind11作为中间件来连接Python和C++
  • 例如Python流程中调用tllm.ExecutorConfig,可以调用到C++ ExecutorConfig::ExecutorConfig interface
  • Pybind11 module 调用链过程是:
    • 1.先通过CMakeLists.txt编译生成Python可以使用的bindings module
    • 2.再bindings.cpp文件中对executor绑定为bindings子模块,在Python代码中可以直接导入
    • 3.最后对接口进行映射,并且对接口参数都初始化,可以让Python调用的时候可以不是必须全部传入,减少代码复杂度
//cpp\tensorrt_llm\pybind\CMakeLists.txt
set(TRTLLM_PYBIND_MODULE bindings)
set(TRTLLM_PYBIND_MODULE
    ${TRTLLM_PYBIND_MODULE}
    PARENT_SCOPE)
// cpp\tensorrt_llm\pybind\bindings.cpp
//Pybind链接关联层
auto mExecutor = m.def_submodule("executor", "Executor bindings");
tensorrt_llm::pybind::executor::initBindings(mExecutor);

//cpp\tensorrt_llm\pybind\executor\executorConfig.cpp 
//Pybind映射层,py::arg定义参数类型,.def_property类成员属性绑定
py::class_<tle::ExecutorConfig>(m, "ExecutorConfig", pybind11::dynamic_attr())
    .def(py::init<                                                   //
             SizeType32,                                             // MaxBeamWidth
             tle::SchedulerConfig const&,                            // SchedulerConfig
             tle::KvCacheConfig const&,                              // KvCacheConfig
             >(),
        py::arg("max_beam_width") = 1, py::arg_v("scheduler_config", tle::SchedulerConfig(), "SchedulerConfig()"),
        py::arg_v("kv_cache_config", tle::KvCacheConfig(), "KvCacheConfig()"),
        py::arg("enable_chunked_context") = false, py::arg("normalize_log_probs") = true,
        .def_property("max_beam_width", &tle::ExecutorConfig::getMaxBeamWidth, &tle::ExecutorConfig::setMaxBeamWidth)
        .def_property("max_batch_size", &tle::ExecutorConfig::getMaxBatchSize, &tle::ExecutorConfig::setMaxBatchSize)
        .def_property("max_num_tokens", &tle::ExecutorConfig::getMaxNumTokens, &tle::ExecutorConfig::setMaxNumTokens)
        );
##tensorrt_llm\executor\executor.py
##导入module
from ..bindings import executor as tllm

##tensorrt_llm\llmapi\llm.py
##调用点
self._executor_config = tllm.ExecutorConfig(
    max_beam_width=self.args.max_beam_width,
    scheduler_config=PybindMirror.maybe_to_pybind(
        self.args.scheduler_config),
    batching_type=PybindMirror.maybe_to_pybind(self.args.batching_type)
    or tllm.BatchingType.INFLIGHT,
    max_batch_size=max_batch_size,
    max_num_tokens=max_num_tokens,
    gather_generation_logits=self.args.gather_generation_logits,
    fail_fast_on_attention_window_too_large=getattr(
        self.args, 'fail_fast_on_attention_window_too_large', False))

ExecutorConfig

  • ExecutorConfig接口实现在cpp\tensorrt_llm\executor\executorConfig.cpp
  • ExecutorConfig实现很简单只是对参数的有效性进行验证
// cpp\tensorrt_llm\executor\executorConfig.cpp
ExecutorConfig::ExecutorConfig(SizeType32 maxBeamWidth, SchedulerConfig schedulerConfig, KvCacheConfig kvCacheConfig,
    bool enableChunkedContext, bool normalizeLogProbs, SizeType32 iterStatsMaxIterations,
    SizeType32 requestStatsMaxIterations, BatchingType batchingType)
    : mMaxBeamWidth(maxBeamWidth)
    , mSchedulerConfig(std::move(schedulerConfig))
    , mKvCacheConfig(std::move(kvCacheConfig))
    , mEnableChunkedContext(enableChunkedContext)
    , mNormalizeLogProbs(normalizeLogProbs)
    , mIterStatsMaxIterations(iterStatsMaxIterations)
    , mRequestStatsMaxIterations(requestStatsMaxIterations)
    , mBatchingType(batchingType)
    , mMaxBatchSize(maxBatchSize)
{
    TLLM_CHECK(iterStatsMaxIterations >= 0);
    TLLM_CHECK(requestStatsMaxIterations >= 0);
    TLLM_CHECK(mMaxBeamWidth > 0);
    TLLM_CHECK(maxSeqIdleMicroseconds > 0);
}

Executor

  • tllm.Executor也是调用到C++ Layer实现,流程和ExecutorConfig类似只是文件名为executor.cpp,直接来看实现代码
//cpp\tensorrt_llm\pybind\executor\executor.cpp
Executor::Executor(
    std::filesystem::path const& modelPath, tle::ModelType modelType, tle::ExecutorConfig const& executorConfig)
{
    mExecutor = std::make_unique<tle::Executor>(modelPath, modelType, executorConfig);
}

// cpp\tensorrt_llm\executor\executor.cpp
Executor::Executor(std::filesystem::path const& modelPath, ModelType modelType, ExecutorConfig const& executorConfig)
    : mImpl(std::make_unique<Executor::Impl>(modelPath, std::nullopt, modelType, executorConfig))
{
}

//cpp\tensorrt_llm\executor\executorImpl.cpp
Executor::Impl::Impl(std::filesystem::path const& modelPath,
    std::optional<std::filesystem::path> const& encoderModelPath, ModelType const modelType,
    ExecutorConfig const& executorConfig)
  • 具体代码实现Impl来承接,执行到Executor::Impl::Impl构造函数
  • initializeCommAndWorkers初始化parallelConfig
  • modelType为kDECODER_ONLY,直接执行loadModel function
  • createModel初始化处理request的batch方式赋值到gptModelType
  • create中instantiate TrtGptModelInflightBatching,Inflight Batching是TensorRT LLM设计的新特性,能在推理过程中动态合并新到达的请求。
  • TrtGptModelInflightBatching 函数中具有GPT 模型结构、TensorRT 推理能力、Inflight Batching 调度逻辑三者解耦并初始化等功能
  • inflight batching技术可以最大的利用GPU资源,在相同时间处理更多的requests
  • loadModel结束之后会执行initialize,在其中会获取参数和Launch the execution thread,这个executionLoop在后面会用到
    inflight batching
  • _create_engine执行结束之后,从.engine file同级目录下读取config.json
  • 通过_engine_config_to_model_config function instantiate data class ModelConfig,把config.json中的配置读取到_runtime_model_config,可以让TensorRT LLM在运行过程中获取优化后engine file的具体内部参数
// TensorRT-LLM加载config之后的打印
[TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen
[TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False
[TRT-LLM] [I] Set dtype to float16.
[TRT-LLM] [I] Set bert_attention_plugin to auto.
[TRT-LLM] [I] Set gpt_attention_plugin to auto.
[TRT-LLM] [I] Set gemm_plugin to float16.
[TRT-LLM] [I] Set explicitly_disable_gemm_plugin to False.

流程图:
 Executor

TrtGptModelInflightBatching

  • 单独把TrtGptModelInflightBatching function拿出来做一个章节是因为这个函数功能非常多
  • 初始化TrtGptModelInflightBatching首先会初始化父类TrtGptModel,TrtGptModel构造函数会从executorConfig中读取各种参数值设置,后面会在实际处理的过程中更改
  • 接着instantiate Class TllmRuntime,StreamReader function读取.engine format 文件,binary读取方式。
  • deserializeCudaEngine将预优化的.engine file 反序列化为 ICudaEngine 对象,ICudaEngine 是 TensorRT 的核心:包含了优化后的模型计算图(如层结构、算子融合结果)、输入输出张量定义、GPU 计算逻辑,具体实现是在TensorRT中
  • createEngineInspector读取Engine file的内部细节检查
  • mBufferManager为getDeviceMemorySizeV2分配计算图元数据、静态中间缓存、非流传输权重作为execution context memory,到这里总的实现序列化引擎文件到可执行 GPU 推理环境的转换
// cpp\tensorrt_llm\runtime\tllmRuntime.cpp
TllmRuntime::TllmRuntime() {
	auto reader = StreamReader(rawEngine.getPath());
	mEngine.reset(mRuntime->deserializeCudaEngine(reader));
	
	mEngineInspector.reset(mEngine->createEngineInspector());
	assessLikelihoodOfRuntimeAllocation(*mEngine, *mEngineInspector);
	setWeightStreaming(getEngine(), gpuWeightsPercent);
	auto const devMemorySize = mEngine->getDeviceMemorySizeV2();
	mEngineBuffer = mBufferManager.gpu(devMemorySize); }
  • 模型读取完成之后,check mModelConfig、mWorldConfig、kvCacheConfig配置
  • createBuffers创建出getMaxNumSequences数量的SlotDecoderBuffers放置在queue中,这个步骤会计算消耗的GPU memory
  • createDecoder function创建出DecoderState and GptDecoder instance分别setup,GptDecoder中内部逻辑中也是分配GPU memory。如下面代码片所示,这是GptDecoder function多次调用到的结果,其中mBufferManager的操作就是在分配GPU memory。
// cpp\tensorrt_llm\layers\dynamicDecodeLayer.cpp
template <typename T>
void DynamicDecodeLayer<T>::initialize()
{
    TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);

    mOutputIdsPtrHost = mBufferManager->pinnedPool(ITensor::makeShape({}), TRTDataType<TokenIdType*>::value);
    mParentIdsPtrHost = mBufferManager->pinnedPool(ITensor::makeShape({}), TRTDataType<TokenIdType*>::value);
    mOutputIdsPtrDevice = mBufferManager->gpu(
        ITensor::makeShape({static_cast<SizeType32>(mDecoderDomain.getBatchSize())}), TRTDataType<TokenIdType*>::value);
    mParentIdsPtrDevice = mBufferManager->gpu(
        ITensor::makeShape({static_cast<SizeType32>(mDecoderDomain.getBatchSize())}), TRTDataType<TokenIdType*>::value);

    allocateBuffer();

    mCyclicStep = 0;
    mRuntimeMaxSeqLen = 0;
    mConfiguredBeamWidth = -1;

    if (!mDecodingMode.isAuto())
    {
        mConfiguredBeamWidth = mDecoderDomain.getBeamWidth();
        initializeLayers();
    }

    TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
}
  • 在这里介绍这部分代码的原因是想介绍mBufferManager,TensorRT-LLM 还有一个介绍名字:A TensorRT Toolbox for Optimized Large Language Model Inference
  • nvinfer1 namespace就来自于#include <NvInferRuntime.h>头文件,而这个头文件就来自于TensorRT,并且是其中核心组件的头文件
  • TensorRT-LLM很多很多核心功能都是依赖于TensorRT实现
// cpp\tensorrt_llm\layers\baseLayer.h
std::shared_ptr<runtime::BufferManager> mBufferManager;

//cpp\include\tensorrt_llm\runtime\bufferManager.h
#pragma once
#include "tensorrt_llm/common/assert.h"
#include "tensorrt_llm/runtime/cudaStream.h"
#include "tensorrt_llm/runtime/iBuffer.h"
#include "tensorrt_llm/runtime/iTensor.h"
#include <NvInferRuntime.h>
#include <cstring>
#include <memory>
#include <set>
#include <string>
#include <vector>

class BufferManagerTest;
namespace tensorrt_llm::runtime
{
/// @brief Forward declaration as only used through pointer.
class CudaMemPool;
//! \brief A helper class for managing memory on host and device.
class BufferManager
{
public:
    using IBufferPtr = IBuffer::UniquePtr;
    using ITensorPtr = ITensor::UniquePtr;
    explicit BufferManager(CudaStreamPtr stream, bool trimPool = false);

    static auto constexpr kBYTE_TYPE = nvinfer1::DataType::kUINT8;
    //! \brief Allocates an `IBuffer` of the given size on the GPU, using cudaMallocAsync.
    [[nodiscard]] IBufferPtr gpu(std::size_t size, nvinfer1::DataType type = kBYTE_TYPE) const;
    //! \brief Allocates an `IBuffer` of the given size on the GPU, using cudaMalloc.
    [[nodiscard]] static IBufferPtr gpuSync(std::size_t size, nvinfer1::DataType type = kBYTE_TYPE);
};} 
  • createKvCacheManager 创建管理KV cache instance,和设置KV cache Pools,一般是有两级cache pool,primary GPU和secondary CPU cache pool,其中GPU速度最快。
  • 会对pool划分block更加细致管理kv cache,我调试打印primaryBlocks=412, .secondayBlocks=0,Number of tokens per block=32,没有打开CPU cache pool,设置的Max Attention window Size=13184。
  • 需要满足公式:单 block token 数 * block数量 >= windowSize,代码中直接就是相等的
  • 实例化CapacityScheduler这是选择动态批处理(Inflight Batching)执行器的三种容量调度策略,因为hasKvCacheManager enable,所以实例化MaxUtilizationScheduler。
// cpp\include\tensorrt_llm\executor\types.h
/// @brief The policy used to select the subset of available requests in each iteration of the executor generation loop
enum class CapacitySchedulerPolicy
{
    /// @brief MAX_UTILIZATION packs as many requests as the underlying TRT engine can support in any iteration of the
    /// InflightBatching generation loop. While this is expected to maximize GPU throughput, it might require that some
    /// requests be paused and restarted depending on peak KV cache memory availability.
    kMAX_UTILIZATION = 0,

    /// @brief GUARANTEED_NO_EVICT uses KV cache more conservatively guaranteeing that a request, once started, will run
    /// to completion without eviction.
    kGUARANTEED_NO_EVICT = 1,

    /// @brief kSTATIC_BATCH does not schedule new requests until all requests in current batch are completed.
    /// Similar to kGUARANTEED_NO_EVICT, requests will run to completion without eviction.
    kSTATIC_BATCH = 2
};
  • 最后部分代码是推测性解码(Speculative Decoding)优化和解码器请求管理的核心逻辑。
    流程图:
    TrtGptModelInflightBatching

summary

  • 通过一和二两篇博客,我们分析了TensorRT-LLM加载engine模型的流程,python提供API给应用层调用,通过pybind中间件调用到C++,耗时和具体处理部分都是C++实现。
  • 模型初始化完整之后运行uvicorn.Server,接收http请求
    uvicorn.Server
  • 接下来分析:TensorRT LLM接收到 request 请求并且处理的流程:trtllm-serve启动流程–HTTP Request

Tips

  • TLLM_LOG_LEVEL环境变量可以设置TensorRT-LLM 打印log level 方便调试
// 设置log level debug
export TLLM_LOG_LEVEL=DEBUG
// 设置log level info
export TLLM_LOG_LEVEL=INFO

运行报错解决经验分享

Bus error

Bus error

  • 根本原因就是Docker 运行容器中physical memory不足,kill一些进程就能解决,因为我是在编译的过程中去运行大模型就会出现这样的问题,单独运行其中一个进程就没问题

final link failed

final link failed

  • 这个ld链接错误的根本原因我没有找到…
  • 解决办法是:重启系统之后就可以编译通过了
Logo

火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。

更多推荐