【AI学习】大模型量化实践
按照教程尝试进行模型量化,将InternLM2-Chat-7B 模型进行4、8bit量化。很费劲,记录一下过程。
按照教程尝试进行模型量化,将InternLM2-Chat-7B 模型进行4、8bit量化。很费劲,记录一下过程。
前奏
conda create -n lmdeploy python=3.10
conda activate lmdeploy
pip install lmdeploy
pip install timm
1. 量化模型
LMDeploy 量化模块目前支持 AWQ 量化算法。
量化是指将高精度数字转换为低精度数字。低精度实体可以存储在磁盘上很小的空间内,从而减少内存需求。
先进入环境:
conda activate lmdeploy
执行代码:
注:这段代码的作用是使用 LMDeploy 工具对 InternLM2-Chat-7B 模型进行 4-bit AWQ(Activation-aware Weight Quantization)量化,以减少模型的计算和存储开销,同时尽量保持模型性能
export HF_MODEL=/root/share/model_repos/internlm2-chat-7b
export WORK_DIR=/root/internlm2-chat-7b-4bit
lmdeploy lite auto_awq $HF_MODEL --calib-dataset 'ptb' --calib-samples 128 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --batch-size 1 --work-dir $WORK_DIR
执行后报错:
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 27.60it/s]
Move model.tok_embeddings to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.layers.24 to CPU.
Move model.layers.25 to CPU.
Move model.layers.26 to CPU.
Move model.layers.27 to CPU.
Move model.layers.28 to CPU.
Move model.layers.29 to CPU.
Move model.layers.30 to CPU.
Move model.layers.31 to CPU.
Move model.norm to GPU.
Move output to CPU.
Loading calibrate dataset …
Traceback (most recent call last):
File “/root/.conda/envs/lmdeploy/bin/lmdeploy”, line 8, in
sys.exit(run())
File “/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py”, line 39, in run
args.run(args)
File “/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/lite.py”, line 111, in auto_awq
auto_awq(**kwargs)
File “/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py”, line 87, in auto_awq
vl_model, model, tokenizer, work_dir = calibrate(model,
File “/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/calibrate.py”, line 292, in calibrate
calib_loader, _ = get_calib_loaders(calib_dataset, tokenizer, nsamples=calib_samples, seqlen=calib_seqlen)
File “/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/utils/calib_dataloader.py”, line 296, in get_calib_loaders
return get_ptb(tokenizer, nsamples, seed, seqlen)
File “/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/utils/calib_dataloader.py”, line 58, in get_ptb
traindata = load_dataset(‘ptb_text_only’, ‘penn_treebank’, split=‘train’, trust_remote_code=True)
File “/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py”, line 2062, in load_dataset
builder_instance = load_dataset_builder(
File “/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py”, line 1819, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
TypeError: ‘NoneType’ object is not callable
这个错误表明 datasets 库虽然已安装,但在尝试加载 PTB (Penn Treebank) 数据集时仍然失败,导致 load_dataset() 返回了 None。
针对报错,尝试如下指令:
pip uninstall -y datasets && pip install --no-cache-dir "datasets==2.19.2" "huggingface_hub>=0.20.0"
这个命令 完全可行,是一个精准的解决方案!它针对 Hugging Face datasets 库的版本兼容性问题进行了明确修复。以下是详细分析:
- 彻底卸载旧版本
pip uninstall -y datasets 确保清除可能存在的版本冲突或损坏的安装文件。 - 指定稳定版本组合
- datasets==2.19.2:此版本(2024年发布)修复了多项数据集加载 Bug,尤其是与 fsspec(文件系统操作库)的兼容性问题。
- huggingface_hub>=0.20.0:新版 Hub 库改进了数据集下载的可靠性和缓存管理。
- 禁用缓存安装
–no-cache-dir 避免使用本地可能污染的缓存包,确保从 PyPI 重新拉取干净版本。
然后再次执行:
lmdeploy lite auto_awq $HF_MODEL --calib-dataset 'ptb' --calib-samples 128 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --batch-size 1 --work-dir $WORK_DIR
2. W8A8量化
LMDeploy 提供了使用 8-bit 整数(INT8)和浮点数(FP8)对神经网络模型进行量化和推理的功能。
- pip install lmdeploy[all]
lmdeploy 提供了命令行工具 lmdeploy lite smooth_quant 实现量化功能。并且其中命令行参数 --quant-dtype 可以用来控制是进行8-bit整数还是浮点数类型的量化。更多命令行工具使用方式,请执行 lmdeploy lite smooth_quant --help 查看。
- int8量化
export HF_MODEL=/root/share/model_repos/internlm2-chat-7b
export WORK_DIR=/root/internlm2-chat-7b-in8bit
lmdeploy lite $HF_MODEL --work-dir $WORK_DIR --quant-dtype int8
- fp8量化
export HF_MODEL=/root/share/model_repos/internlm2-chat-7b
export WORK_DIR=/root/internlm2-chat-7b-fp8bit
lmdeploy lite smooth_quant $HF_MODEL --work-dir $WORK_DIR --quant-dtype fp8
首先执行int8量化指令
出现下面错误:
usage: lmdeploy lite [-h] {auto_awq,auto_gptq,calibrate,smooth_quant} …
lmdeploy lite: error: argument {auto_awq,auto_gptq,calibrate,smooth_quant}: invalid choice: ‘/root/share/model_repos/internlm2-chat-7b’ (choose from ‘auto_awq’, ‘auto_gptq’, ‘calibrate’, ‘smooth_quant’)
这个错误是因为使用的 lmdeploy lite 命令格式不正确。lite 子命令需要明确指定量化方法(如 auto_awq、smooth_quant 等),而直接传入模型路径会导致解析失败。以下是修复方法:
错误原因
lmdeploy lite 的子命令结构要求必须选择一种量化方法,但你的命令中直接传入了模型路径 /root/share/model_repos/internlm2-chat-7b,导致系统无法识别。
正确命令格式
根据需求(int8 量化),应使用 smooth_quant 方法。修改后的命令如下:
export HF_MODEL=/root/share/model_repos/internlm2-chat-7b
export WORK_DIR=/root/internlm2-chat-7b-in8bit
# 使用 smooth_quant 方法进行 int8 量化
lmdeploy lite smooth_quant $HF_MODEL --work-dir $WORK_DIR --quant-dtype int8
然后执行fp8量化指令
export HF_MODEL=/root/share/model_repos/internlm2-chat-7b
export WORK_DIR=/root/internlm2-chat-7b-fp8bit
lmdeploy lite smooth_quant $HF_MODEL --work-dir $WORK_DIR --quant-dtype fp8
这次正常输出。
3. Key-Value(KV) Cache 量化
自 v0.4.0 起,LMDeploy 支持在线 kv cache int4/int8 量化,量化方式为 per-head per-token 的非对称量化。
通过 LMDeploy 应用 kv 量化非常简单,只需要设定 quant_policy 参数即可。
LMDeploy 规定 qant_policy=4 表示 kv int4 量化
quant_policy=8 表示 kv int8 量化。
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(quant_policy=8)
pipe = pipeline("/root/share/model_repos/internlm2-chat-7b",
backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
执行上面的代码,输出:
[TM][WARNING] [LlamaTritonModel] max_context_token_num is not set, default to 32768.
2025-06-12 19:58:09,923 - lmdeploy - WARNING - turbomind.py:252 - get 227 model params
2025-06-12 19:58:22,062 - lmdeploy - WARNING - async_engine.py:645 - GenerationConfig: GenerationConfig(n=1, max_new_tokens=512, do_sample=False, top_p=1.0, top_k=50, min_p=0.0, temperature=0.8, repetition_penalty=1.0, ignore_eos=False, random_seed=None, stop_words=None, bad_words=None, stop_token_ids=[2, 92540, 92542], bad_token_ids=None, min_new_tokens=None, skip_special_tokens=True, spaces_between_special_tokens=True, logprobs=None, response_format=None, logits_processors=None, output_logits=None, output_last_hidden_state=None)
2025-06-12 19:58:22,062 - lmdeploy - WARNING - async_engine.py:646 - Since v0.6.0, lmdeploy add do_sample in GenerationConfig. It defaults to False, meaning greedy decoding. Please set do_sample=True if sampling decoding is needed
[Response(text=‘你好,我是书生·浦语,是上海人工智能实验室开发的一款语言模型。我致力于通过执行常见的基于语言的任务和提供建议来帮助人类。我能够回答问题、提供定义和解释、将文本从一种语言翻译成另一种语言、总结文本、生成文本、编写故事、分析情感、提供推荐、开发算法、编写代码以及其他任何基于语言的任务。但我不能看、听、尝、触摸、闻、移动、与物理世界交互、感受情感或体验感官输入、执行需要身体能力的任务。’, generate_token_len=110, input_token_len=108, finish_reason=‘stop’, token_ids=[77230, 60353, 68734, 60628, 60384, 60721, 62442, 60752, 60353, 60357, 68589, 76659, 71581, 68640, 73453, 68790, 70218, 60355, 60363, 71890, 68330, 69056, 70929, 70513, 68790, 75720, 60381, 68403, 68571, 60383, 68417, 69497, 60355, 60363, 68445, 85625, 60359, 68403, 69248, 60381, 69478, 60359, 60530, 69684, 60577, 68617, 68790, 70414, 60397, 75186, 68790, 60359, 68621, 69684, 60359, 70563, 69684, 60359, 71976, 68654, 60359, 68578, 70680, 60359, 68403, 68667, 60359, 68640, 73060, 60359, 71976, 69681, 83289, 68574, 70513, 68790, 75720, 60355, 70452, 68336, 60422, 60359, 60998, 60359, 61830, 60359, 73547, 60359, 61523, 60359, 68788, 60359, 60510, 69473, 68339, 75883, 60359, 69377, 70680, 60535, 68954, 87877, 68412, 60359, 69056, 68266, 68514, 74781, 68565, 60355], logprobs=None, logits=None, last_hidden_state=None, index=0), Response(text=“Shanghai is a vibrant and bustling city located in the eastern part of China. It is known for its rich history, diverse culture, and modern architecture. Some of the notable landmarks in Shanghai include the iconic Oriental Pearl Tower, the historic Yu Garden, and the famous Bund waterfront area. The city is also famous for its delicious cuisine, such as xiaolongbao (soup dumplings) and shengjianbao (pan-fried buns). Shanghai is a major economic and financial center in China, and it plays a crucial role in the country’s global influence.”, generate_token_len=124, input_token_len=105, finish_reason=‘stop’, token_ids=[2166, 30326, 505, 395, 33083, 454, 20988, 2880, 3446, 7553, 435, 410, 23478, 1087, 446, 5772, 281, 1226, 505, 4037, 500, 1326, 9209, 3995, 328, 16937, 7813, 328, 454, 6637, 17786, 281, 4489, 446, 410, 27588, 58619, 435, 36956, 3089, 410, 26712, 37764, 6453, 36187, 22202, 328, 410, 18183, 27666, 19175, 328, 454, 410, 11396, 29887, 3181, 7101, 3247, 281, 707, 3446, 505, 1225, 11396, 500, 1326, 18067, 35009, 328, 1893, 569, 991, 817, 468, 775, 260, 3602, 451, 269, 13484, 425, 503, 631, 953, 313, 454, 688, 960, 326, 1246, 260, 3602, 451, 984, 2371, 4649, 424, 11012, 699, 36956, 505, 395, 3759, 7105, 454, 6050, 4285, 435, 5772, 328, 454, 563, 11240, 395, 16721, 3638, 435, 410, 3311, 725, 3805, 10311, 281], logprobs=None, logits=None, last_hidden_state=None, index=1)]
4. 模型评测
使用OpenCompass对模型进行评测
step1: 安装 OpenCompass和lmdeploy
在评测一个模型时,需要准备一份评测配置,指明评测集、模型和推理参数等信息。
以 internlm2-chat-7b 模型为例,相关的配置信息如下:
# configure the dataset
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_0shot_v2_gen_a58960 import \
gsm8k_datasets
# and output the results in a chosen format
from .summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
# configure lmdeploy
from opencompass.models import TurboMindModelwithChatTemplate
# configure the model
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr=f'internlm2-chat-7b-lmdeploy',
# model path, which can be the address of a model repository on the Hugging Face Hub or a local path
path='internlm/internlm2-chat-7b',
# inference backend of LMDeploy. It can be either 'turbomind' or 'pytorch'.
# If the model is not supported by 'turbomind', it will fallback to
# 'pytorch'
backend='turbomind',
# For the detailed engine config and generation config, please refer to
# https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/messages.py
engine_config=dict(tp=1),
gen_config=dict(do_sample=False),
# the max size of the context window
max_seq_len=7168,
# the max number of new tokens
max_out_len=1024,
# the max number of prompts that LMDeploy receives
# in `generate` function
batch_size=5000,
run_cfg=dict(num_gpus=1),
)
]
把上述配置放在文件中,比如 “configs/eval_internlm2_lmdeploy.py”。然后,在 OpenCompass 的项目目录下,执行如下命令可得到评测结果:
python run.py opencompass/configs/eval_internlm2_lmdeploy.py -w outputs
internlm2-chat-7b-4bit模型评估
进入评测的环境:
conda activate opencompass
将评估代码中的模型路径指向你量化后的模型目录($WORK_DIR):
path='/root/internlm2-chat-7b-4bit', # 指向量化后的模型目录
配置信息如下:
# configure the dataset
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_0shot_v2_gen_a58960 import \
gsm8k_datasets
# and output the results in a chosen format
from .summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
# configure lmdeploy
from opencompass.models import TurboMindModelwithChatTemplate
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='internlm2-chat-7b-4bit-awq', # 修改模型简称
path='/root/internlm2-chat-7b-4bit', # 指向量化后的模型
backend='turbomind',
engine_config=dict(
tp=1,
model_format='awq',
quant_policy=4,
rope_scaling_factor=1.0
),
gen_config=dict(do_sample=False),
max_seq_len=7168,
max_out_len=1024,
batch_size=5000,
run_cfg=dict(num_gpus=1),
)
]
把上述配置放在文件 “opencompass/configs/eval_internlm2_lmdeploy.py”。然后,在 opencompass 的项目目录下,执行如下命令进行评测:
python run.py opencompass/configs/eval_internlm2_lmdeploy.py -w outputs
但是运行太慢,改为只评测部分数据集:
with read_base():
# 只导入需要的模块
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .summarizers.medium import summarizer
# 在 with 块外处理数据集筛选
datasets = [
[d for d in mmlu_datasets if 'college_biology' in d['name']][0], # 1个MMLU子集
[d for d in ceval_datasets if 'computer_network' in d['name']][0] # 1个CEVAL子集
]
运行结束后,评测结果来看,实际评测已经成功完成,但汇总报告(summary)没有正确显示分数。问题出在 summarizer 的配置上。OpenCompass 默认的 summarizer 会展示所有数据集类别的框架,即使你没有评测这些数据集
修改代码如下:
with read_base():
# 只导入需要的模块
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .summarizers.medium import summarizer
# 只显示评测过的数据集
summarizer['dataset_abbrs'] = [
'lukaemon_mmlu_college_biology',
'ceval-computer_network',
]
# 在 with 块外处理数据集筛选
datasets = [
[d for d in mmlu_datasets if 'college_biology' in d['name']][0], # 1个MMLU子集
[d for d in ceval_datasets if 'computer_network' in d['name']][0] # 1个CEVAL子集
]
评测正常进行,结果如下:
| dataset | version | metric | mode | internlm2-chat-7b-4bit-awq |
|---|---|---|---|---|
| lukaemon_mmlu_college_biology | 8c2e29 | accuracy | gen | 64.58 |
| ceval-computer_network | db9ce2 | accuracy | gen | 57.89 |
火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。
更多推荐
所有评论(0)