#TensorRTLLM1.0 实战#

重磅发布!NVIDIA TensorRT LLM 1.0 上线
https://marketing.csdn.net/p/2f305fdae56d5d43fd0a970a7fe7348d?pId=3163
《NVIDIA TensorRT LLM 1.0 使用指南》链接
https://img-bss.csdnimg.cn/bss/NVIDIA/TensorRT-LLM.html

安装指导文件见https://nvidia.github.io/TensorRT-LLM/latest/installation/linux.html
之所以使用wsl是因为不想切linux(linux那个cuda环境不能动,TRT手册要CUDA 12.9),windows下有些依赖包搞不定或者很麻烦,比如libopenmpi-dev,就wsl了~

conda create -n=trt python=3.12


conda activate trt

注意wsl第一要版本2的,第二,wsl安装cuda有自己的步骤,不能使用标准桌面linux那个。在NVIDIA的手册明确说,不要再wsl里装驱动,这个坑导致我重装wsl足足3次。


wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.9.1/local_installers/cuda-repo-wsl-ubuntu-12-9-local_12.9.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-9-local_12.9.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-9-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-9


然后在home目录的.bashrc中增加如下内容,找不到nvcc 就靠这些变量解决了。


export PATH=/usr/local/cuda-12/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-12

conda activate trt

后续正式开始安装,首先pytorch和必要库,需要装一个多G

pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128


sudo apt-get -y install libopenmpi-dev

# Optional step: Only required for disagg-serving
sudo apt-get -y install libzmq3-dev


然后装一下trt

pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm

卡死了…

pip3 install tensorrt_llm
Collecting tensorrt_llm
  Using cached tensorrt_llm-1.0.0.tar.gz (1.6 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... -

卡了足足10多分钟后,自动继续滚动下去了
然后接着卡这句:

Collecting tensorrt_cu12_libs==10.11.0.33 (from tensorrt_cu12==10.11.0.33->tensorrt~=10.11.0->tensorrt_llm)
  Downloading tensorrt_cu12_libs-10.11.0.33.tar.gz (709 bytes)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... -

大约10分钟,继续了
最后安装了这么一大堆后完成了,很顺利

      Successfully uninstalled fsspec-2025.9.0
Successfully installed StrEnum-0.4.15 accelerate-1.11.0 aenum-3.1.16 aiohappyeyeballs-2.6.1 aiohttp-3.13.2 aiosignal-1.4.0 annotated-types-0.7.0 antlr4-python3-runtime-4.9.3 anyio-4.11.0 attrs-25.4.0 backoff-2.2.1 blake3-1.0.8 blobfile-3.1.0 build-1.3.0 certifi-2025.10.5 cffi-2.0.0 charset_normalizer-3.4.4 click-8.3.0 click_option_group-0.5.9 colored-2.3.1 contourpy-1.3.3 cuda-bindings-12.9.4 cuda-pathfinder-1.3.2 cuda-python-12.9.4 cycler-0.12.1 datasets-3.1.0 diffusers-0.35.2 dill-0.3.8 distro-1.9.0 einops-0.8.1 etcd3-0.12.0 evaluate-0.4.6 fastapi-0.115.4 flashinfer-python-0.2.5 fonttools-4.60.1 frozenlist-1.8.0 fsspec-2024.9.0 grpcio-1.76.0 h11-0.16.0 h5py-3.12.1 hf-xet-1.2.0 httpcore-1.0.9 httpx-0.28.1 huggingface-hub-0.36.0 idna-3.11 importlib_metadata-8.7.0 jiter-0.11.1 kiwisolver-1.4.9 lark-1.3.1 llguidance-0.7.29 lxml-6.0.2 markdown-it-py-4.0.0 matplotlib-3.10.7 mdurl-0.1.2 meson-1.9.1 ml_dtypes-0.5.3 mpi4py-4.1.1 multidict-6.7.0 multiprocess-0.70.16 ninja-1.13.0 numpy-1.26.4 nvidia-ml-py-12.575.51 nvidia-modelopt-0.33.1 nvidia-modelopt-core-0.33.1 nvtx-0.2.13 omegaconf-2.3.0 onnx-1.19.1 onnx_graphsurgeon-0.5.8 openai-2.7.1 opencv-python-headless-4.11.0.86 optimum-2.0.0 ordered-set-4.1.0 packaging-25.0 pandas-2.3.3 peft-0.17.1 pillow-10.3.0 polygraphy-0.49.26 propcache-0.4.1 protobuf-6.33.0 psutil-7.1.3 pulp-3.3.0 pyarrow-22.0.0 pycparser-2.23 pycryptodomex-3.23.0 pydantic-2.12.4 pydantic-core-2.41.5 pydantic-settings-2.11.0 pygments-2.19.2 pynvml-12.0.0 pyparsing-3.2.5 pyproject_hooks-1.2.0 python-dateutil-2.9.0.post0 python-dotenv-1.2.1 pytz-2025.2 pyyaml-6.0.3 pyzmq-27.1.0 regex-2025.11.3 requests-2.32.5 rich-14.2.0 safetensors-0.6.2 scipy-1.16.3 sentencepiece-0.2.1 setuptools-79.0.1 six-1.17.0 sniffio-1.3.1 soundfile-0.13.1 starlette-0.41.3 tenacity-9.1.2 tensorrt-10.11.0.33 tensorrt_cu12-10.11.0.33 tensorrt_cu12_bindings-10.11.0.33 tensorrt_cu12_libs-10.11.0.33 tensorrt_llm-1.0.0 tiktoken-0.12.0 tokenizers-0.21.4 torchprofile-0.0.4 tqdm-4.67.1 transformers-4.53.1 typing-inspection-0.4.2 tzdata-2025.2 urllib3-2.5.0 uvicorn-0.38.0 xgrammar-0.1.21 xxhash-3.6.0 yarl-1.22.0 zipp-3.23.0

写一个程序跑一下,用TRT给的例子:


from tensorrt_llm import LLM, SamplingParams
def main():

    # Model could accept HF model name, a path to local HF model,
    # or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

    # Sample prompts.
    prompts = [
        "Hello, my name is",
        "The capital of France is",
        "The future of AI is",
    ]

    # Create a sampling params.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    for output in llm.generate(prompts, sampling_params):
        print(
            f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}"
        )

	if __name__ == '__main__':
	    main()



执行后,在漫长的等待后,报了个错,访问不了huggingface.co


Traceback (most recent call last):
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
OSError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 488, in _make_request
    raise new_e
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 464, in _make_request
    self._validate_conn(conn)
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
    conn.connect()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 753, in connect
    self.sock = sock = self._new_conn()
                       ^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 213, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x716f0fd40b30>: Failed to establish a new connection: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/requests/adapters.py", line 644, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/retry.py", line 519, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/TinyLlama/TinyLlama-1.1B-Chat-v1.0/revision/main (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x716f0fd40b30>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

后面还有一堆,不用管它了。
国内用hf,执行了
export HF_ENDPOINT=https://hf-mirror.com
之后,重新运行还是报错,按提示升级了一下

pip install transformers -U

增加

export HF_HUB_BASE_URL=https://hf-mirror.com

运行发现还报错。

一通折腾后,发现

trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

居然下来了。
然后立刻回到试验程序

(trt) bobo@DESKTOP-K65EUBR:~/test_trt$ python test_trt.py
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:27:24] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
2025-11-06 21:27:28,299 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[11/06/2025-21:27:28] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/06/2025-21:27:28] [TRT-LLM] [W] Using default gpus_per_node: 1
[11/06/2025-21:27:28] [TRT-LLM] [I] Set nccl_plugin to None.
[11/06/2025-21:27:28] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
rank 0 using MpiPoolSession to spawn MPI processes
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:27:36] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
Multiple distributions found for package optimum. Picked distribution: optimum
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
2025-11-06 21:27:41,023 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
`torch_dtype` is deprecated! Use `dtype` instead!
Loading safetensors weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 104.27it/s]
Loading weights: 100%|██████████| 449/449 [00:00<00:00, 572.52it/s]
Model init total -- 3.57s
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.18 GiB for max tokens in paged KV cache (8352).
2025-11-06 21:27:47,676 - INFO - flashinfer.jit: Loading JIT ops: norm
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
^CTraceback (most recent call last):
  File "/home/bobo/test_trt/test_trt.py", line 33, in <module>
    main()
  File "/home/bobo/test_trt/test_trt.py", line 8, in main
    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 1125, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 942, in __init__
    super().__init__(model,
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 214, in __init__
    self._build_model()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 1072, in _build_model
    self._executor = self._executor_cls.create(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/executor.py", line 423, in create
    return GenerationExecutorProxy(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/proxy.py", line 105, in __init__
    self._start_executor_workers(worker_kwargs)
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/proxy.py", line 319, in _start_executor_workers
    if self.worker_init_status_queue.poll(1):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/ipc.py", line 110, in poll
    events = dict(self.poller.poll(timeout=timeout * 1000))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/zmq/sugar/poll.py", line 106, in poll
    return zmq_poll(self.sockets, timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "zmq/backend/cython/_zmq.py", line 1680, in zmq.backend.cython._zmq.zmq_poll
  File "zmq/backend/cython/_zmq.py", line 179, in zmq.backend.cython._zmq._check_rc
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py'>
Traceback (most recent call last):
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1594, in _shutdown
    atexit_call()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/mpi4py/futures/_core.py", line 172, in join_threads
    thread.join()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1149, in join
    self._wait_for_tstate_lock()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1169, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt:
[11/06/2025-21:30:34] [TRT-LLM] [E] Failed to send object: None
^C^C^CException ignored in atexit callback: <function shutdown_compile_workers at 0x7707940725c0>
Traceback (most recent call last):
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 113, in shutdown_compile_workers
    pool.shutdown()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 239, in shutdown
    self.process.wait(300)
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/subprocess.py", line 1277, in wait
    self._wait(timeout=sigint_timeout)
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/subprocess.py", line 2047, in _wait
    time.sleep(delay)
KeyboardInterrupt:

发现卡住后补了个变量
export TORCH_CUDA_ARCH_LIST=“8.6;8.9”

后面的值是这么来的:
python -c “import torch; print(torch.cuda.get_device_capability())”
回复(8,9),就把8.9加后面,然后接着执行。

(trt) bobo@DESKTOP-K65EUBR:~/test_trt$ python test_trt.py
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:30:46] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
2025-11-06 21:30:50,629 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[11/06/2025-21:30:50] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/06/2025-21:30:50] [TRT-LLM] [W] Using default gpus_per_node: 1
[11/06/2025-21:30:50] [TRT-LLM] [I] Set nccl_plugin to None.
[11/06/2025-21:30:50] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
rank 0 using MpiPoolSession to spawn MPI processes
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:31:03] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
Multiple distributions found for package optimum. Picked distribution: optimum
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
2025-11-06 21:31:07,496 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[TensorRT-LLM][INFO] Refreshed the MPI local session
`torch_dtype` is deprecated! Use `dtype` instead!
Loading safetensors weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 61.60it/s]
Loading weights: 100%|██████████| 449/449 [00:00<00:00, 557.09it/s]
Model init total -- 2.23s
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.18 GiB for max tokens in paged KV cache (8352).
2025-11-06 21:31:12,625 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-11-06 21:31:48,328 - INFO - flashinfer.jit: Finished loading JIT ops: norm
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 17.72 GiB for max tokens in paged KV cache (844512).
Processed requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.15it/s]
Prompt: 'Hello, my name is', Generated text: '[Your Name] and I am a [Your Position] at [Your Company]. I am writing to express my interest in the [Job Title] position at'
Prompt: 'The capital of France is', Generated text: 'Paris.\n\n2. B. C. The capital of Canada is Ottawa.\n\n3. A. C. The capital of Australia is Can'
Prompt: 'The future of AI is', Generated text: "bright, and it's not just for big companies. Small businesses can also benefit from AI technology. Here are some ways:\n\n1."

顺利跑通。
在这里插入图片描述

跑个服务试试:

trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

直到显示:

INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'validation_alias' attribute with value 'max_tokens' was provided to the `Field()` function, which has no effect in the context it was used. 'validation_alias' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.
  warnings.warn(

在这里插入图片描述
后面那个http请求是我访问了一次验证可用了,回去截的图。
然后再起一个wsl的bash,输入

curl -X POST http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -H "Accept: application/json"     -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "messages":[{"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
        "max_tokens": 32,
        "temperature": 0
    }'

得到答复:“New York is a city in the northeastern United States, located on the eastern coast of the state of New York.”

在这里插入图片描述

{"id":"chatcmpl-31b02f6ab4854863909850ab9688d8b1","object":"chat.completion","created":1762436547,"model":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","choices":[{"index":0,"message":{"role":"assistant","content":"New York is a city in the northeastern United States, located on the eastern coast of the state of New York.","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":43,"total_tokens":70,"completion_tokens":27},"prompt_token_ids":null}(trt) 

在这里插入图片描述

又试了几个,能回答,理解能力有限,不能说中文。
换!换DeepSeek-R1-Distill-Qwen-1.5B

trtllm-serve "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

挂了~提示了一堆关于如何控制显存的信息。

Please refer to the TensorRT LLM documentation for information on how to control the memory usage through TensorRT LLM configuration options. Possible options include:
  Model: reduce max_num_tokens and/or shard the model weights across GPUs by enabling pipeline and/or tensor parallelism
  Sampler: reduce max_seq_len and/or max_attention_window_size
  Initial KV cache (temporary for KV cache size estimation): reduce max_num_tokens
  Drafter: reduce max_seq_len and/or max_draft_len
  Additional executor resources (temporary for KV cache size estimation): reduce max_num_tokens
  Model resources created during usage: reduce max_num_tokens
  KV cache: reduce free_gpu_memory_fraction
  Additional executor resources: reduce max_num_tokens

目前通过外部cmd访问能看到访问,但是会报400错误。
测试回复时间,4090单卡大约在170ms~340ms之间,速度还是可以,可能也是得益于模型比较小。但是从内存消耗来看,默认参数比其他大模型平台要多一点,我用的环境是4090单卡,一般可以跑7B到14B量化模型。头一次碰到跑1B基本满掉,跑1.5B跑不起来的情况。
在这里插入图片描述

整体得益于NV和社区良好的环境,安装适配时间很短,报错仔细看文档和提示信息都可以很快解决。但是自动下载从hf在国内是个很麻烦的事情,我试着下载好模型改成本地目录,似乎并没有效果。在教程里也是不是说的很详细。
但从工业部署来说,现场可能会使用非网上发布模型,所以TRT还是要完善本地模型加载的描述,甚至需要考虑优先本地模型加载。

最后,感谢CSDN带来的这次体验机会,非常赞。

Logo

火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。

更多推荐