---
license: cc-by-4.0
datasets:
- panlr/teochew_wild
language:
- zh
pipeline_tag: automatic-speech-recognition
---

### 模型简介
本模型是Whisper-medium的一个微调版本，用于对潮州话（潮汕话）的正字识别（并非翻译为普通话），微调的代码源自[夜雨飘零](https://github.com/yeyupiaoling/Whisper-Finetune/tree/master)大佬的github仓库。

### 在线Demo

[teochew_whisper](https://huggingface.co/spaces/panlr/teochew_whisper)

### 微调数据
微调训练的数据来源于[teochew-wild](https://huggingface.co/datasets/panlr/teochew_wild)，这是首个开源的、野外的、正字准确标注的多说话人潮汕话数据集，包含约18.9小时、共12500条潮汕话音频片段，覆盖了潮州府城、汕头市区、潮安南部、澄海、榕江音等多种口音。

为了减少字面歧义、多音字过多、同义异体字等问题，该数据集的标注采用自创的 **[歹看正字法](https://github.com/p1an-lin-jung/teochew-g2p/blob/master/doc/readme.md)** ，而非通常使用的谐音字或者专家考证的本字。

这是因为在谐音字或者专家方案中，非常容易出现歧义，例如：
```
若用【个】表示【的】，那么【有个人】，究竟是【有一个人】，还是【有的人】呢？ 因此本数据集用【介】代替【个】。
若用【只】表示【这】，那么【这只猫】、【这只车】，便会写成【只只猫】、【只只车】，看起来非常怪异，因此本数据集用繁异字【祇】表示【这】之意，其他情况同普通话。
```


### 评估结果
我对12500条数据随机划分成训练集、验证集、测试集，分别有11000，700，700条数据。在RTX 3090上经过大约10个epoch的微调训练，采用字错误率（CER）作为评估指标，结果如下：
(论文进行实验时，对标签的同音字进行了部分统一，如【仔】和【囝】【二】和【两】，所以得到了更好的效果)

| 数据子集           | CER（%）                  |
|------------------|-----------------------------|
| 验证集           |       12.865                |
| 测试集           |       12.254                |


### Get Started

example.wav 、inference.py、requirements.txt 在 'infer_example' 目录中。

```
安装依赖：
pip install -r requirements.txt
```


#### 快速使用：
```
from transformers import pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# 创建识别管道
pipe = pipeline(
    "automatic-speech-recognition",
    model="panlr/whisper-finetune-teochew",
    device=device
)
# 识别音频
result = pipe("example.wav")
print(result["text"])
```


#### 标准加载、调用：
```
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# 设置设备
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# 加载模型
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "panlr/whisper-finetune-teochew",
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True
)
model.to(device)

# 加载处理器
processor = AutoProcessor.from_pretrained("panlr/whisper-finetune-teochew")

# 创建推理管道
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

# 执行识别
result = pipe("example.wav")
print(result["text"])
```


#### 命令行推理脚本

使用示例：
```
    python inference.py audio.wav
    python inference.py audio.wav --model panlr/whisper-finetune-teochew
    python inference.py audio.wav --cpu
    python inference.py audio1.wav audio2.wav audio3.wav --output result.txt
```