• 镜像站地址 https://hf-mirror.com/facebook/m2m100_1.2B

  • 点击Files and versions这几个文件可以使用wget -c 断点续传下载下来,rust_model.ot可以不用下

  • 需要sentencepiece,pip安装

ImportError: M2M100Tokenizer requires the SentencePiece library but it was not found in your environment.

  • 不需要spm_100k.model了,名字是sentencepiece.bpe.model

OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted

在这里插入图片描述

  • 这个model文件很重要,wget下载下来是网页形式,可以点进去下载下来,大小是2.42M核对一下,出错基本是这个问题

  • 在布尔值这里卡了很久,甚至怀疑transformers源码读取出了问题,证明是sentencepiece.bpe.model下载出错

AttributeError: ‘bool’ object has no attribute ‘src_lang’

  • 检查方式 print(type(tokenizer)),正确输出

<class ‘transformers.models.m2m_100.tokenization_m2m_100.M2M100Tokenizer’>

  • 检查 sentencepiece.bpe.model 文件内容不是合法的 SentencePiece 模型
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="填写model路径")
print("✅ 模型加载成功")

RuntimeError: Internal: could not parse ModelProto from /home/x/fanyi_models/m2m100_1.2B/sentencepiece.bpe.model

另外whisper安装选择了medium.en,下载地址参考最新的版本
https://github.com/openai/whisper/blob/main/whisper/init.py


脚本代码由chatgpt生成

调用whisper脚本

import whisper

# 1️⃣ 加载模型(支持中文用 medium / large,多语言;仅英文可用 medium.en)
model = whisper.load_model("/home/x/fanyi_models/medium.en.pt")

# 2️⃣ 转录 webm 文件
result = model.transcribe("/home/x/fanyi_models/test_models/chris.m4a")

# 3️⃣ 保存为 .srt 文件
with open("chris.srt", "w", encoding="utf-8") as f:
    for seg in result["segments"]:
        start = seg["start"]
        end = seg["end"]
        text = seg["text"].strip()

        # 格式化时间戳为 SRT 格式
        def fmt_time(s):
            h = int(s // 3600)
            m = int((s % 3600) // 60)
            s = s % 60
            ms = int((s - int(s)) * 1000)
            return f"{h:02d}:{m:02d}:{int(s):02d},{ms:03d}"

        f.write(f"{seg['id'] + 1}\n{fmt_time(start)} --> {fmt_time(end)}\n{text}\n\n")

生成双语字幕的脚本,直接读取分割后srt进行翻译,可能有更好的方式

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
from tqdm import tqdm

# === 模型路径 ===
model_path = "/home/x/fanyi_models/m2m100_1.2B"  # 改为你的模型目录
tokenizer = M2M100Tokenizer.from_pretrained(model_path, local_files_only=True)
model = M2M100ForConditionalGeneration.from_pretrained(model_path, local_files_only=True)

tokenizer.src_lang = "en"
tgt_lang = "zh"

# === 输入输出路径 ===
input_file = "chris.srt"
output_file = "chris_bilingual.srt"

# === 读取原始字幕 ===
with open(input_file, "r", encoding="utf-8") as f:
    lines = f.readlines()

output_lines = []
buffer = []

# === 翻译循环 ===
for line in tqdm(lines, desc="Translating"):
    if line.strip() == "":
        # 段落结束
        if buffer:
            text = " ".join(buffer).strip()
            inputs = tokenizer(text, return_tensors="pt")
            generated_tokens = model.generate(
                **inputs,
                forced_bos_token_id=tokenizer.get_lang_id(tgt_lang)
            )
            zh_translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
            # 写入原英文 + 中文
            output_lines.append(text + "\n")
            output_lines.append(zh_translation.strip() + "\n\n")
            buffer = []
        else:
            output_lines.append("\n")
        continue

    if "-->" in line or line.strip().isdigit():
        output_lines.append(line)
    else:
        buffer.append(line.strip())

# === 保存双语字幕 ===
with open(output_file, "w", encoding="utf-8") as f:
    f.writelines(output_lines)

print(f"✅ 双语字幕已生成:{output_file}")

github的assets下载yt-dlp二进制文件运行,使用插件cookie-editor导出netscape为cookies.txt文件就可以愉快的下载了

Logo

火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。

更多推荐