【音频处理】python实现对音频进行简单的静音检测和去除

这种方法通过计算音频帧的能量（通常是均方根能量或短时能量）来判断是否为静音。当能量低于某个阈值时，就认为该帧是静音。使用过零率的静音检测的方案，切除多余的尾部静音，具体逻辑是当尾部禁音达到300ms时切除多余的部分。过零率是指信号穿过零轴的次数。静音部分的过零率通常较低。

kakaZhui

2233人浏览 · 2025-01-26 00:00:00

kakaZhui · 2025-01-26 00:00:00 发布

技术方案

在 Python 中，可以使用以下方法对音频进行简单的静音检测和去除：

1. 基于能量的静音检测

这种方法通过计算音频帧的能量（通常是均方根能量或短时能量）来判断是否为静音。当能量低于某个阈值时，就认为该帧是静音。

所需库:

librosa: 用于音频特征提取和分析 (pip install librosa)
numpy: 用于数值计算 (pip install numpy)
soundfile (可选): 用于音频文件的读取和写入 (pip install soundfile)

步骤:

加载音频文件: 使用 librosa.load() 或 soundfile.read() 加载音频文件。
分帧: 将音频信号分成短时帧。
计算每帧的能量: 使用 librosa.feature.rms() 计算均方根能量，或计算每帧的平方和。
设置阈值: 根据音频的背景噪声水平，设置一个合适的能量阈值。
静音检测: 将每帧的能量与阈值进行比较，低于阈值的帧被标记为静音。
去除静音: 根据静音标记，将静音帧从音频信号中移除或替换为零。
保存音频 (可选): 使用 librosa.output.write_wav() 或 soundfile.write() 将处理后的音频保存到文件。

代码示例:

import librosa
import numpy as np
import soundfile as sf

def remove_silence_energy(audio_file, output_file=None, frame_length=2048, hop_length=512, energy_threshold=0.005):
    """
    基于能量的静音去除

    Args:
        audio_file: 输入音频文件路径
        output_file: 输出音频文件路径 (可选)
        frame_length: 帧长
        hop_length: 帧移
        energy_threshold: 能量阈值

    Returns:
        non_silent_audio: 去除静音后的音频数据
    """

    # 加载音频
    y, sr = librosa.load(audio_file)

    # 计算均方根能量
    rms = librosa.feature.rms(y=y, frame_length=frame_length, hop_length=hop_length)[0]

    # 静音检测
    silent_frames = rms < energy_threshold

    # 去除静音
    non_silent_indices = np.where(~silent_frames)[0]
    non_silent_audio = y[non_silent_indices[0] * hop_length : (non_silent_indices[-1] + 1) * hop_length]

    # 保存音频 (可选)
    if output_file:
        sf.write(output_file, non_silent_audio, sr)

    return non_silent_audio

# 使用示例
audio_file = "input.wav"
output_file = "output.wav"
non_silent_audio = remove_silence_energy(audio_file, output_file)

2. 基于过零率的静音检测

过零率是指信号穿过零轴的次数。静音部分的过零率通常较低。

所需库:

librosa: 用于音频特征提取和分析
numpy: 用于数值计算

步骤:

加载音频文件: 使用 librosa.load() 加载音频文件。
分帧: 将音频信号分成短时帧。
计算每帧的过零率: 使用 librosa.feature.zero_crossing_rate() 计算每帧的过零率。
设置阈值: 根据音频的特性，设置一个合适的过零率阈值。
静音检测: 将每帧的过零率与阈值进行比较，低于阈值的帧被标记为静音。
去除静音: 根据静音标记，将静音帧从音频信号中移除或替换为零。

代码示例:

import librosa
import numpy as np

def remove_silence_zcr(audio_file, frame_length=2048, hop_length=512, zcr_threshold=0.01):
    """
    基于过零率的静音去除

    Args:
        audio_file: 输入音频文件路径
        frame_length: 帧长
        hop_length: 帧移
        zcr_threshold: 过零率阈值

    Returns:
        non_silent_audio: 去除静音后的音频数据
    """

    # 加载音频
    y, sr = librosa.load(audio_file)

    # 计算过零率
    zcr = librosa.feature.zero_crossing_rate(y=y, frame_length=frame_length, hop_length=hop_length)[0]

    # 静音检测
    silent_frames = zcr < zcr_threshold

    # 去除静音
    non_silent_indices = np.where(~silent_frames)[0]
    non_silent_audio = y[non_silent_indices[0] * hop_length : (non_silent_indices[-1] + 1) * hop_length]

    return non_silent_audio

# 使用示例
audio_file = "input.wav"
non_silent_audio = remove_silence_zcr(audio_file)

举个栗子

使用过零率的静音检测的方案，切除多余的尾部静音，具体逻辑是当尾部禁音达到300ms时切除多余的部分

import numpy as np

def trim_trailing_silence_zcr_int16(audio_data, sample_rate, zcr_threshold=0.02, frame_length_ms=20, hop_length_ms=10, silence_duration_ms=300):
    """
    使用过零率检测并切除 int16 类型音频数据的尾部静音

    Args:
        audio_data: int16 类型的 NumPy 数组，表示音频数据
        sample_rate: 采样率
        zcr_threshold: 过零率阈值
        frame_length_ms: 帧长（毫秒）
        hop_length_ms: 帧移（毫秒）
        silence_duration_ms: 认定为尾部静音的持续时长（毫秒）

    Returns:
        trimmed_audio: 切除尾部静音后的音频数据 (int16 类型)
    """

    # 将毫秒转换为样本数
    frame_length = int(frame_length_ms * sample_rate / 1000)
    hop_length = int(hop_length_ms * sample_rate / 1000)
    silence_frames_threshold = int(silence_duration_ms / hop_length_ms)

    # 分帧 (使用 as_strided)
    num_frames = 1 + (len(audio_data) - frame_length) // hop_length
    frames = np.lib.stride_tricks.as_strided(
        audio_data,
        shape=(num_frames, frame_length),
        strides=(audio_data.strides[0] * hop_length, audio_data.strides[0])
    )

    # 计算每帧的过零率
    zcr = np.sum(np.abs(np.diff(np.sign(frames))), axis=1) / (2 * frame_length)

    # 从后往前查找静音
    silent_frames_count = 0
    trim_index = len(audio_data)  # 初始化为音频末尾

    for i in range(len(zcr) - 1, -1, -1):
        if zcr[i] < zcr_threshold:
            silent_frames_count += 1
            if silent_frames_count >= silence_frames_threshold:
                trim_index = (i + silence_frames_threshold -1) * hop_length
                # 确保 trim_index 在帧的开头
                trim_index = max(0, trim_index)
                break

        else:
            silent_frames_count = 0

    # 切除尾部静音
    trimmed_audio = audio_data[:trim_index]

    return trimmed_audio.astype(np.int16)

# 使用示例
# 假设 audio_data 是你的 int16 音频数据，sample_rate 是采样率
# audio_data = ...
# sample_rate = ...
# 创建一段示例的静音音频数据 (int16)
sample_rate = 44100
duration = 5  # seconds
silent_samples = int(0.5 * sample_rate) # 0.5秒的静音
non_silent_samples = int(0.5 * sample_rate) # 0.5秒的非静音

audio_data = np.concatenate([
    np.random.randint(-10000, 10000, size=non_silent_samples, dtype=np.int16), # 一段非静音
    np.zeros(silent_samples, dtype=np.int16), # 中间静音
    np.random.randint(-10000, 10000, size=non_silent_samples, dtype=np.int16), # 另一段非静音
    np.zeros(int(0.8 * sample_rate), dtype=np.int16)  # 结尾静音 0.8秒
])

trimmed_audio = trim_trailing_silence_zcr_int16(audio_data, sample_rate)

# 打印结果
print("原始音频长度:", len(audio_data))
print("切除尾部静音后音频长度:", len(trimmed_audio))

# 可以使用 soundfile 或者其他库播放或保存处理后的音频
# import soundfile as sf
# sf.write("trimmed_audio.wav", trimmed_audio, sample_rate)

代码解释:

trim_trailing_silence_zcr_int16(...) 函数:
- zcr_threshold: 过零率阈值，低于此值认为是静音帧。
- frame_length_ms, hop_length_ms: 帧长和帧移（毫秒）。
- silence_duration_ms: 尾部静音持续时间阈值（毫秒），超过此时间则切除。
分帧: 与之前相同，使用 np.lib.stride_tricks.as_strided 进行高效分帧。
计算过零率:
- np.sign(frames): 将帧中的样本值转换为符号（-1, 0, 1）。
- np.diff(...): 计算相邻样本符号之间的差。
- np.abs(...): 取差的绝对值。
- np.sum(..., axis=1): 对每一帧（每一行）求和，得到每帧的符号变化次数。
- 除以 (2 * frame_length): 归一化为过零率。
从后往前查找静音:
- silent_frames_count: 记录连续静音帧的数量。
- trim_index: 记录切除位置，初始化为音频末尾。
- 倒序遍历 zcr 数组。
- 如果当前帧的过零率低于阈值，silent_frames_count 加 1。
- 如果 silent_frames_count 达到 silence_frames_threshold (300ms 对应的帧数)，则：
  - 计算切除位置 trim_index。需要加上 silence_frames_threshold -1 因为是从当前帧 silent_frames_threshold 之前的那个帧的hop_length开始切
  - trim_index = max(0, trim_index) 确保 trim_index 不小于 0。
  - 跳出循环。
- 如果当前帧不是静音帧，silent_frames_count 重置为 0。
切除尾部静音:
- trimmed_audio = audio_data[:trim_index]

智能体开发者社区

中国智能体开发者社区，聚焦智能体与大模型开发，提供前沿资讯、实用工具链、开源项目及行业案例。通过技术沙龙、开发者大赛等活动，促进经验交流与协作，助力开发者快速构建创新智能应用。

更多推荐

OpenClaw 本地部署完整指南（Windows + Ollama）

本文档基于实际部署经验编写，旨在帮助你在 Windows 系统上从零开始搭建 OpenClaw，并连接本地 Ollama 模型（如 Qwen2.5 或 Qwen3），使其具备完整的智能体能力。文档包含了所有关键步骤以及常见问题的解决方案。

智能体开发者社区

OpenClaw 小白安装指南（Windows版）

（类似一个能自动执行任务的AI机器人），不是游戏。API Key只保存在你本地电脑的加密文件里，不会上传到任何地方。访问：https://github.com/miaoxworld/openclaw-manager/releases。: 一键安装脚本会自动安装Node.js 22+，如果失败，手动下载安装：https://nodejs.org/：在PowerShell中，鼠标右键就是粘贴，不需要按

智能体开发者社区

飞书 × OpenClaw 接入指南：不用服务器，用长连接把机器人跑起来

这个项目存在的意义，就是把“飞书接 OpenClaw”这件事，整理成一套的配置入口，并把官方文档没覆盖到的坑集中写成排查清单。先说清楚它的角色：OpenClaw 现在已经内置官方飞书插件 @openclaw/feishu，功能更完整、维护也更及时。，说明飞书 + AI 的接入已经走通。另外，仓库也推荐了一个新项目：把 OpenClaw 变成“多 Agent 团队”，用多个 Agent 分工，Sla