🔥 前言:Qwen3 为何成为开发者新宠?

Qwen3 是阿里通义实验室最新开源的大语言模型,一发布就登顶开源 LLM 榜单榜首,超越 LLaMA 成为开源模型社区最受欢迎的 LLM。无论是研究学习还是应用落地,Qwen3 都已成为开发者的优选之一。

想要人工帮微调,可以加微:DaBengDaBeng,获取一对一模型优化服务,助力企业级应用落地。

📌 核心知识点:全参数微调 vs 其他微调方式

什么是全参数微调?

全参数微调是指对预训练大模型的所有参数进行更新和优化,区别于部分参数微调和 LoRA 微调。具体来说:

  • 调整范围:更新底层词嵌入、中间特征提取层和顶层任务适配层的所有权重
  • 优势:充分利用预训练模型泛化能力,深度适配特定任务,在数据差异大或任务复杂场景表现更优
  • 挑战:需要更高计算资源(本案例需 32GB 显存),存在过拟合风险

适用场景:专业领域知识问答、高精度文本生成等对模型性能要求高的场景。

🚀 实战开始:Qwen3-1.7B 医学领域微调

1. 环境准备

bash

# 环境要求
Python >= 3.8
NVIDIA/昇腾显卡(建议32GB显存)
PyTorch + CUDA环境

# 一键安装依赖
pip install swanlab modelscope==1.22.0 "transformers>=4.50.0" datasets==3.2.0 accelerate pandas addict

2. 数据集准备

本案例使用delicate_medical_r1_data医学对话数据集(2000 + 条数据),每条数据包含:

  • question:用户提问(模型输入)
  • think:模型思考过程(类似 DeepSeek R1 的推理过程)
  • answer:最终回答内容

数据处理代码:

python

运行

from modelscope.msdatasets import MsDataset
import json
import random

random.seed(42)

# 加载数据集
ds = MsDataset.load('krisfu/delicate_medical_r1_data', subset_name='default', split='train')
data_list = list(ds)
random.shuffle(data_list)

# 划分训练集和验证集
split_idx = int(len(data_list) * 0.9)
train_data = data_list[:split_idx]
val_data = data_list[split_idx:]

# 保存为JSONL格式
with open('train.jsonl', 'w', encoding='utf-8') as f:
    for item in train_data:
        json.dump(item, f, ensure_ascii=False)
        f.write('\n')

with open('val.jsonl', 'w', encoding='utf-8') as f:
    for item in val_data:
        json.dump(item, f, ensure_ascii=False)
        f.write('\n')

print(f"Train Set Size:{len(train_data)}")
print(f"Val Set Size:{len(val_data)}")

3. 模型加载与配置

python

运行

from modelscope import snapshot_download, AutoTokenizer
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import torch

# 从ModelScope下载Qwen3-1.7B模型
model_dir = snapshot_download("Qwen/Qwen3-1.7B", cache_dir="./", revision="master")

# 加载分词器和模型
tokenizer = AutoTokenizer.from_pretrained(
    "./Qwen/Qwen3-1.7B", 
    use_fast=False, 
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "./Qwen/Qwen3-1.7B", 
    device_map="auto", 
    torch_dtype=torch.bfloat16
)

4. 训练配置与 SwanLab 监控

使用 SwanLab 进行训练过程监控(中国版 Weights & Biases + Tensorboard):

python

运行

from transformers import TrainingArguments, Trainer
import swanlab

# 配置SwanLab项目
os.environ["SWANLAB_PROJECT"] = "qwen3-sft-medical"
PROMPT = "你是一个医学专家,你需要根据用户的问题,给出带有思考的回答。"
MAX_LENGTH = 2048

swanlab.config.update({
    "model": "Qwen/Qwen3-1.7B",
    "prompt": PROMPT,
    "data_max_length": MAX_LENGTH,
})

# 训练参数配置
args = TrainingArguments(
    output_dir="./output/Qwen3-1.7B",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    eval_strategy="steps",
    eval_steps=100,
    logging_steps=10,
    num_train_epochs=2,
    save_steps=400,
    learning_rate=1e-4,
    save_on_each_node=True,
    gradient_checkpointing=True,
    report_to="swanlab",
    run_name="qwen3-1.7B",
)

5. 完整训练代码

python

运行

import json
import pandas as pd
import torch
from datasets import Dataset
from modelscope import snapshot_download, AutoTokenizer
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import os
import swanlab

# 环境与配置
os.environ["SWANLAB_PROJECT"] = "qwen3-sft-medical"
PROMPT = "你是一个医学专家,你需要根据用户的问题,给出带有思考的回答。"
MAX_LENGTH = 2048
swanlab.config.update({
    "model": "Qwen/Qwen3-1.7B",
    "prompt": PROMPT,
    "data_max_length": MAX_LENGTH,
})

# 数据格式转换
def dataset_jsonl_transfer(origin_path, new_path):
    messages = []
    with open(origin_path, "r") as file:
        for line in file:
            data = json.loads(line)
            input_text = data["question"]
            output_text = f"<|FunctionCallBegin|>{data['think']}<|FunctionCallEnd|> \n {data['answer']}"
            message = {
                "instruction": PROMPT,
                "input": f"{input_text}",
                "output": output_text,
            }
            messages.append(message)
    with open(new_path, "w", encoding="utf-8") as file:
        for message in messages:
            file.write(json.dumps(message, ensure_ascii=False) + "\n")

# 数据预处理
def process_func(example):
    input_ids, attention_mask, labels = [], [], []
    instruction = tokenizer(
        f"<|im_start|>system\n{PROMPT}<|im_end|>\n<|im_start|>user\n{example['input']}<|im_end|>\n<|im_start|>assistant\n",
        add_special_tokens=False,
    )
    response = tokenizer(f"{example['output']}", add_special_tokens=False)
    input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
    attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
    if len(input_ids) > MAX_LENGTH:
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

# 预测函数
def predict(messages, model, tokenizer):
    device = "cuda"
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=MAX_LENGTH,
    )
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

# 加载模型
model_dir = snapshot_download("Qwen/Qwen3-1.7B", cache_dir="/root/autodl-tmp/", revision="master")
tokenizer = AutoTokenizer.from_pretrained(
    "/root/autodl-tmp/Qwen/Qwen3-1.7B", 
    use_fast=False, 
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "/root/autodl-tmp/Qwen/Qwen3-1.7B", 
    device_map="auto", 
    torch_dtype=torch.bfloat16
)
model.enable_input_require_grads()

# 处理数据集
train_dataset_path = "train.jsonl"
test_dataset_path = "val.jsonl"
train_jsonl_new_path = "train_format.jsonl"
test_jsonl_new_path = "val_format.jsonl"

if not os.path.exists(train_jsonl_new_path):
    dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)
if not os.path.exists(test_jsonl_new_path):
    dataset_jsonl_transfer(test_dataset_path, test_jsonl_new_path)

# 加载数据集
train_df = pd.read_json(train_jsonl_new_path, lines=True)
train_ds = Dataset.from_pandas(train_df)
train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)

eval_df = pd.read_json(test_jsonl_new_path, lines=True)
eval_ds = Dataset.from_pandas(eval_df)
eval_dataset = eval_ds.map(process_func, remove_columns=eval_ds.column_names)

# 训练参数
args = TrainingArguments(
    output_dir="/root/autodl-tmp/output/Qwen3-1.7B",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    eval_strategy="steps",
    eval_steps=100,
    logging_steps=10,
    num_train_epochs=2,
    save_steps=400,
    learning_rate=1e-4,
    save_on_each_node=True,
    gradient_checkpointing=True,
    report_to="swanlab",
    run_name="qwen3-1.7B",
)

# 开始训练
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
)
trainer.train()

# 模型评估
test_df = pd.read_json(test_jsonl_new_path, lines=True)[:3]
test_text_list = []
for index, row in test_df.iterrows():
    instruction = row['instruction']
    input_value = row['input']
    messages = [
        {"role": "system", "content": f"{instruction}"},
        {"role": "user", "content": f"{input_value}"}
    ]
    response = predict(messages, model, tokenizer)
    response_text = f"""
    Question: {input_value}
    LLM:{response}
    """
    test_text_list.append(swanlab.Text(response_text))
    print(response_text)

swanlab.log({"Prediction": test_text_list})
swanlab.finish()

6. 训练结果与模型推理

训练监控与过拟合分析

通过 SwanLab 可视化训练仪表盘可以看到:

  • 蓝色曲线(train loss)随训练迭代阶梯式下降
  • 绿色曲线(eval loss)在第 1 轮 epoch 下降,第 2 轮上升,表明出现过拟合
  • 结论:2000 条数据量级下,全参微调 1 个 epoch 即可
模型生成效果示例

输入问题:

"医生,我最近胃部不适,听说有几种抗溃疡药物可以治疗,您能详细介绍一下这些药物的分类、作用机制以及它们是如何影响胃黏膜的保护与损伤平衡的吗?"

模型输出:

plaintext

<|FunctionCallBegin|>

嗯,用户问的是抗溃疡药物的分类、作用机制,以及它们如何影响胃黏膜的保护和损伤平衡...(思考过程省略)

当然可以。抗溃疡药物主要分为四类:抑酸药、胃黏膜保护剂、促胃动力药和抗幽门螺杆菌药物...(完整回答省略)
推理已训练模型

python

运行

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def predict(messages, model, tokenizer):
    device = "cuda"
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=2048)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

# 加载训练好的模型
tokenizer = AutoTokenizer.from_pretrained(
    "./output/Qwen3-1.7B/checkpoint-1000", 
    use_fast=False, 
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "./output/Qwen3-1.7B/checkpoint-1000", 
    device_map="auto", 
    torch_dtype=torch.bfloat16
)

# 测试推理
test_texts = {
    'instruction': "你是一个医学专家,你需要根据用户的问题,给出带有思考的回答。",
    'input': "医生,我最近被诊断为糖尿病,听说碳水化合物的选择很重要,我应该选择什么样的碳水化合物呢?"
}
instruction = test_texts['instruction']
input_value = test_texts['input']
messages = [
    {"role": "system", "content": f"{instruction}"},
    {"role": "user", "content": f"{input_value}"}
]
response = predict(messages, model, tokenizer)
print(response)

📦 资源链接

🌟 福利:人工微调服务

如果您在模型微调过程中遇到问题,或需要定制化的微调服务,欢迎添加微信:DaBengDaBeng,获取专业技术支持:

  • 一对一模型微调咨询
  • 企业级大模型应用解决方案
  • 自定义数据集处理与优化
  • 模型推理部署与性能优化

关注我,后续将分享更多大模型微调实战案例(金融、法律、客服等领域),记得点赞收藏不迷路!

Logo

火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。

更多推荐