Opensource LLMs for Audio Understanding - British Accents

AylinNaebzadeh · April 24, 2026, 11:09am

Hello

I’m working on a dataset including audio, transcript, gender, and accent from 120 volunteers with different British accents.
I firstly wanted to use ASR to check the ability of models.
Then, use audio classification models for accent detection and also fine-tune models.

But, I’m also looking for directions with some focus on LLMs. If there is a benchmark of opensource LLMs for audio understanding, then I would appreciate it or if you have any other suggestions for contribution on this dataset.

Thanks!

John6666 · April 25, 2026, 7:16am

If you plan to use the dataset for multiple purposes—such as for ASR, TTS, or variations like multimodal LLMs—creating rich metadata (labels) for the dataset during its creation will give you more options later on regarding which model to train and for what purpose. (Rather than the dataset having a fixed structure, you select exactly which data to use and how to use it from within the dataset via settings or scripts immediately before fine-tuning; therefore, the more clues the dataset provides, the more versatile it becomes.)

It’s a trade-off with the effort required for labeling, though…

Core recommendation

Your dataset should be framed as more than an ASR dataset. The strongest contribution is:

A British Accent Audio Understanding Benchmark for open ASR models, speech encoders, accent classifiers, and audio-LLMs.

That framing lets you evaluate five things separately:

Capability	Core question	Model types
ASR	What words were spoken?	Whisper, Qwen3-ASR, Parakeet, Canary, Voxtral transcription mode
Accent ID	Which British accent group is this?	WavLM, XLS-R, ECAPA-TDNN, Whisper encoder + classifier
Audio-LLM understanding	Can the model listen and answer structured questions about transcript, accent, pronunciation, and uncertainty?	Voxtral, Qwen2.5-Omni, Phi-4-multimodal, Kimi-Audio, Ultravox
Pronunciation probing	Can the model hear British-specific phonetic features?	Audio-LLMs, speech encoders, probe classifiers
Fairness / robustness	Which accent/gender groups are underserved?	Any ASR or audio model

Your dataset is valuable because it combines audio + transcript + accent + gender. That combination supports transcription, classification, instruction-following audio tasks, and group-level fairness analysis in one benchmark.

A closely related British Isles accent corpus contains 31+ hours from 120 volunteers self-identifying as speakers of Southern England, Midlands, Northern England, Welsh, Scottish, and Irish English varieties, showing that this scale is credible for British-accent speech research. (ACL Anthology)

1. Why this is timely

English ASR is not the same as British-accent ASR

General English ASR benchmarks often underrepresent accent diversity. EdAcc was created to better represent global English variation and contains almost 40 hours of dyadic English conversations from speakers with diverse accents and linguistic backgrounds. (Edinburgh DataShare)

So your research question should not be:

Can the model transcribe English?

It should be:

Can open ASR and audio-LLM systems handle British regional accents fairly, robustly, and transparently?

Accent errors can be linguistically meaningful

A Newcastle English ASR error-analysis paper links ASR errors to regional dialectal features, including phonological, lexical, and morphosyntactic variation. That supports going beyond WER into error categories such as vowel differences, dialect words, local pronouns, and place-name recognition. (ISCA Archive)

Useful error categories for your work:

Error type	Example
Phonological	TRAP–BATH, rhoticity, glottal /t/
Lexical	“nowt” → “not” or “nothing”
Morphosyntactic	“yous” → “you”
Named entity	local town/station name mistranscribed
Over-normalization	dialect form converted to standard English
Hallucination	extra phrase inserted after silence

2. Existing benchmarks to connect to

There are relevant open audio/speech benchmarks, but none fully covers British regional accent understanding. That gap is your opportunity.

Benchmark / resource	What it gives you	How to use it
AudioBench	AudioLLM benchmark with 8 tasks and 26 datasets across speech, audio scenes, and paralinguistic/voice understanding. (arXiv)	Use as the main precedent for audio-LLM evaluation beyond ASR.
AIR-Bench	Large audio-language model benchmark with 19 tasks / ~19k multiple-choice questions plus ~2k open-ended QA examples. (ACL Anthology)	Use for multiple-choice accent and pronunciation probes.
Dynamic-SUPERB Phase 2	Instruction-based speech/audio benchmark expanded to 180 tasks. (arXiv)	Use as a template for “listen and classify / transcribe / answer” tasks.
Open ASR Leaderboard	Reproducible ASR benchmark comparing 60+ systems across 11 datasets, with WER and RTFx. (Hugging Face)	Use for ASR-style reporting: WER + speed + standardized normalization.
CommonAccent	Accent-classification recipe using ECAPA-TDNN and Wav2Vec2/XLS-R on Common Voice. (arXiv)	Use for accent-ID baselines and classifier design.
ASR-FAIRBENCH	Fairness-aware ASR benchmark combining WER with a fairness score. (arXiv)	Use for accent/gender fairness framing.
Vox-Profile	Speaker/speech trait benchmark including accent, sex, age, voice quality, emotion, and speech flow. (arXiv)	Use as precedent for treating accent/gender as speech traits, with careful ethics.

3. Recommended benchmark tracks

Track 1 — ASR robustness

Question: How well do open ASR models transcribe different British accents?

Evaluate:

audio → transcript

Use metrics beyond average WER:

Metric	Why
Overall WER / CER	Basic transcription accuracy
WER by accent	Core robustness measure
WER by gender	Broad gender-related gap
WER by accent × gender	Intersectional gap
Macro-accent WER	Treats accents equally
Worst-accent WER	Finds most underserved group
Accent WER gap	Worst minus best
Substitution / deletion / insertion rates	Separates wrong words, missed words, hallucinations
Dialect-word recall	Tests words like “nowt,” “aye,” “wee,” etc.
Place-name recall	Tests local named entities
RTFx / latency	Practical deployability

This track answers:

Which open models transcribe British-accented speech best, and which accent groups remain hard?

Track 2 — Accent identification

Question: Can models identify British regional accent groups from raw audio?

Evaluate:

audio → accent_label

Use broad labels for the official benchmark:

Southern England
Midlands
Northern England
Welsh
Scottish
Irish
Northern Irish
uncertain / other

Keep fine labels as metadata:

accent_label_broad = "Northern England"
accent_label_fine = "Newcastle / Tyneside"
accent_self_reported = "Geordie"

Recommended baseline ladder:

Level	Model	Why
0	Majority-class baseline	Sanity check
1	MFCC + logistic regression / SVM	Classical baseline
2	ECAPA-TDNN embeddings + classifier	Strong speaker/acoustic embedding baseline
3	WavLM embeddings + classifier	Strong speech representation baseline
4	XLS-R / Wav2Vec2 embeddings + classifier	Cross-lingual speech representation baseline
5	Whisper encoder embeddings + classifier	Tests whether ASR encoders preserve accent cues
6	Fine-tuned WavLM / XLS-R	Strong supervised classifier
7	Audio-LLM prompted classifier	Tests instruction-following without a classifier head

CommonAccent provides a direct precedent for ECAPA-TDNN and Wav2Vec2/XLS-R accent-classification recipes. (arXiv)

Use metrics:

accuracy
macro-F1
balanced accuracy
per-accent recall
top-2 accuracy
confusion matrix
confidence calibration

Important: official results must use speaker-disjoint splits. Random clip splits can leak speaker identity and inflate accent-classification scores.

Track 3 — Accent-aware ASR

Question: Can an audio-LLM transcribe speech and identify the accent in one structured response?

Evaluate:

audio + instruction → transcript + accent_label + confidence + evidence

Example output:

{
  "transcript": "I went down to the station this morning.",
  "accent_label": "Northern England",
  "confidence": 0.72,
  "evidence": [
    "short BATH vowel",
    "non-rhotic pronunciation",
    "regional vowel quality"
  ]
}

Suggested prompt:

Listen to the audio and return only valid JSON.

Use this schema:
{
  "transcript": string,
  "accent_label": one of [
    "Southern England",
    "Midlands",
    "Northern England",
    "Welsh",
    "Scottish",
    "Irish",
    "Northern Irish",
    "uncertain"
  ],
  "confidence": number from 0 to 1,
  "evidence": list of at most 3 short strings
}

If the accent is unclear, use "uncertain".
Do not invent labels.

Metrics:

Metric	Why
Transcript WER	ASR quality
Accent accuracy	Accent-label correctness
Joint score	Transcript acceptable + accent correct
Valid JSON rate	Instruction-following reliability
Allowed-label rate	Whether the model obeys the label set
Confidence calibration	Whether confidence matches correctness
Evidence quality	Whether explanations are plausible
Uncertainty behavior	Whether “uncertain” is used appropriately

This is probably the cleanest LLM-focused task in your project.

Track 4 — Pronunciation probes

Question: Can models hear British-English pronunciation features?

ASR alone cannot capture this. Two speakers may both say “bath”, but one may use a short TRAP vowel and another a long BATH/PALM vowel. The transcript is identical; the pronunciation is not.

Useful probes:

Feature	Example words	Task
TRAP–BATH split	bath, path, grass, dance	short TRAP vs long BATH/PALM
Rhoticity	car, park, hard, farm	/r/ pronounced or not
Glottal /t/	water, bottle, little	glottalized /t/ or not
STRUT vowel	cup, luck, strut	northern/southern-style vowel quality
Dialect words	nowt, owt, aye, wee	which word was spoken
Local pronouns	yous, wor	which form was spoken
Place names	local towns/stations	which place name was spoken

Example multiple-choice tasks:

In the word "bath", does the speaker use:
A. short TRAP vowel
B. long BATH/PALM vowel
C. unclear

In the word "car", is the /r/ pronounced?
A. yes
B. no
C. unclear

Which word was spoken?
A. nothing
B. nowt
C. not
D. unclear

Metrics:

overall probe accuracy
per-feature accuracy
per-accent accuracy
abstention rate
human agreement

This track is where your dataset becomes more than an ASR benchmark. It becomes a British accent understanding benchmark.

Track 5 — Fairness and group robustness

Question: Which accent/gender groups are underserved?

Use gender mainly as evaluation metadata, not necessarily as a prediction target.

Recommended metrics:

Metric	Meaning
Overall WER	Average transcription quality
Macro-accent WER	Treats accents equally
Worst-accent WER	Most underserved accent
Best-accent WER	Easiest accent
Accent WER gap	Worst minus best
Accent WER ratio	Worst divided by best
Gender WER gap	Difference by gender
Accent × gender WER gap	Intersectional difference
Worst-accent recall	Accent classifier’s weakest group
Calibration by group	Whether confidence is equally reliable

This matters because average WER can hide harm:

Model	Overall WER	Worst-accent WER	Accent gap
Model A	7.8	18.5	13.2
Model B	8.4	13.1	7.4

Model A looks better on average; Model B may be more inclusive.

ASR-FAIRBENCH gives a useful precedent for combining accuracy and equity in ASR evaluation. (arXiv)

4. The key LLM experiment: audio-only vs transcript-only

This is the experiment I would emphasize most.

Core question

Do audio-LLMs actually hear accent cues, or do they infer accent from words in the transcript?

Run four conditions:

Condition	Input	What it tests
Audio only	audio	Can the model hear accent directly?
Gold transcript only	human transcript	Can text alone reveal accent from dialect words?
ASR transcript only	model transcript	How strong is a standard ASR → LLM pipeline?
Audio + transcript	both	Does multimodal input help?

Why this matters:

A text-only LLM may guess Scottish from words like:

wee
aye
ken

or Northern England from:

nowt
owt
yous

But that does not prove it heard the accent. The audio-only condition tests whether the model uses acoustic information. This ablation would make your LLM contribution much stronger.

5. Model families worth testing

ASR models

Model family	Why include
Qwen3-ASR	Current open ASR family; Qwen3-ASR supports language identification and ASR for 52 languages and dialects. (Hugging Face)
Whisper / Whisper turbo	Standard ASR baseline family.
Distil-Whisper	Efficient English ASR baseline.
NVIDIA Parakeet / Canary	Non-Whisper ASR baselines; useful for speed/accuracy diversity.
Voxtral transcription mode	Audio-LLM with dedicated transcription mode and long-context audio support. (Hugging Face)

Audio-LLMs

Model	Why include
Voxtral Mini / Small	Audio model with transcription, Q&A, summarization, and long-context audio support. (Hugging Face)
Qwen2.5-Omni	General multimodal audio/video/text model.
Phi-4-multimodal-instruct	Compact multimodal model with audio input.
Kimi-Audio-7B-Instruct	Open audio foundation model for understanding, generation, and conversation.
Ultravox	Direct speech-to-LLM style model.

Accent classifiers / encoders

Model type	Why include
MFCC + SVM/logistic regression	Simple baseline
ECAPA-TDNN embeddings	Strong speaker/acoustic embedding baseline
WavLM embeddings	Strong speech representation baseline
XLS-R / Wav2Vec2 embeddings	Cross-lingual speech representation baseline
Whisper encoder embeddings	Tests whether ASR features contain accent information
Fine-tuned WavLM / XLS-R	Likely strong supervised accent classifiers

WavLM is a good classification base because its model card describes it as an English pretrained speech model intended for downstream tasks including speech recognition and audio classification. (Hugging Face)

6. Dataset design essentials

Use speaker-disjoint splits

Bad:

Speaker 001 clip 1 → train
Speaker 001 clip 2 → test

Good:

Speaker 001 → train only
Speaker 024 → validation only
Speaker 047 → test only

This is non-negotiable for accent classification. Otherwise the model may learn speaker identity, microphone, room acoustics, or session artifacts.

Keep raw and normalized transcripts

Example:

Raw:        "Aye, I've got nowt to do with it."
Normalized: "aye ive got nowt to do with it"

Do not normalize dialect words into standard English:

nowt → nothing
aye → yes
wee → small

That would erase the phenomenon you want to study.

Store broad + fine accent labels

accent_label_broad = "Northern England"
accent_label_fine = "Newcastle / Tyneside"
accent_self_reported = "Geordie"
accent_label_source = "self-reported"

Add prompt-level metadata

Field	Example
`prompt_id`	`bath_001`
`target_words`	`bath,path,grass`
`target_feature`	`TRAP_BATH`
`contains_dialect_word`	`true`
`contains_place_name`	`true`

Release in a no-script Hugging Face format

Use Parquet or AudioFolder with metadata CSV/JSONL. Hugging Face’s AudioFolder is designed to load audio datasets with thousands of files without requiring custom code. Dataset cards are also recommended to document contents, creation process, use context, and potential biases. (Hugging Face)

Suggested fields:

Field	Purpose
`audio_id`	Unique clip ID
`speaker_id_hash`	Pseudonymous speaker ID
`split`	Train/validation/test
`audio`	Audio file or audio column
`duration_sec`	Clip duration
`transcript_raw`	Original transcript
`transcript_normalized`	WER transcript
`accent_label_broad`	Main benchmark label
`accent_label_fine`	Optional detailed label
`accent_self_reported`	Original self-description
`accent_label_source`	Label provenance
`gender_self_described`	Fairness slicing, if consented
`target_feature`	Pronunciation feature
`recording_condition`	Clean/noisy/phone/unknown
`consent_scope`	Usage permission summary

7. Suggested result tables

ASR summary

Model	Overall WER	Macro-accent WER	Worst-accent WER	Accent gap	RTFx
Qwen3-ASR-1.7B
Parakeet
Canary-Qwen
Whisper large-v3
Whisper large-v3-turbo
Distil-Whisper
Voxtral Mini

Accent classification

Model	Accuracy	Macro-F1	Balanced accuracy	Worst-accent recall	Top-2 accuracy
Majority baseline
MFCC + SVM
ECAPA + classifier
WavLM + classifier
XLS-R + classifier
Whisper encoder + classifier
Audio-LLM prompt

Audio-LLM ablation

Model	Audio only	Gold transcript only	ASR transcript only	Audio + transcript
Voxtral Mini
Qwen2.5-Omni
Phi-4-multimodal
Kimi-Audio

This last table is the most important one for your LLM-focused contribution.

8. Contribution statements you can use

Dataset contribution

We release a speaker-disjoint British regional accent speech dataset with audio, transcripts, gender metadata, accent labels, raw transcripts, normalized transcripts, and benchmark splits.

Benchmark contribution

We introduce a benchmark for British accent audio understanding, covering ASR robustness, accent identification, accent-aware ASR, pronunciation probes, and accent × gender fairness.

Audio-LLM contribution

We test whether audio-LLMs identify accents from acoustic cues or infer them from transcript/dialect words by comparing audio-only, gold-transcript-only, ASR-transcript-only, and audio+transcript settings.

Fairness contribution

We show that average WER is insufficient for British-accent evaluation and report macro-accent WER, worst-accent WER, accent gap, and accent × gender gaps.

Linguistic contribution

We connect model errors to British English pronunciation and dialect features, including TRAP–BATH variation, rhoticity, glottal /t/, dialect words, local pronouns, and place names.

9. Good title options

Title	Emphasis
British Accent Audio Understanding Benchmark	General benchmark framing
Benchmarking Open ASR Models and Audio-LLMs on British Regional Accents	Model evaluation
British Regional Accent Robustness for ASR and Audio-Language Models	Robustness/fairness
Accent-Aware Speech Recognition and Audio Understanding for British English	ASR + audio-LLM
Do Audio-LLMs Hear British Accents?	LLM-focused
Beyond WER: Evaluating British Accent Understanding in Open Speech Models	Pronunciation/fairness

Final compact recommendation

The strongest version of your project is:

A speaker-disjoint British Accent Audio Understanding Benchmark with ASR, accent ID, accent-aware ASR, pronunciation probes, audio-only vs transcript-only LLM ablations, and accent × gender fairness metrics.

This is stronger than a simple ASR dataset because it asks:

Can the model transcribe?
Can the model detect accent?
Can the model identify pronunciation features?
Can the model reason over audio with structured prompts?
Can the model perform fairly across accent and gender groups?
Can audio-LLMs really use audio, or are they relying on transcript clues?

Short summary

Do ASR, but do not stop at ASR.
Use audio classifiers for accent ID.
Use audio-LLMs for structured accent-aware tasks.
Add pronunciation probes to make the dataset linguistically meaningful.
Use speaker-disjoint splits.
Report per-accent, per-gender, and accent × gender metrics.
Keep raw + normalized transcripts.
Release in Parquet or AudioFolder with a strong dataset card.
Delay full audio-LLM fine-tuning until zero-shot results show a clear failure mode.

Topic		Replies	Views
Exploring Accent “Delta” Vectors Using Speaker Embeddings (Accent → Native → Accent Control) Research	1	99	November 14, 2025
Dataset for spanish-accented english Beginners	0	312	August 31, 2022
Create speech to text training dataset using text to speech model Intermediate	0	438	February 8, 2023
Whisper for Arabic–English speech with Indian accent Models	2	121	February 3, 2026
I need a person identification by voice model Models	0	360	November 14, 2023

Opensource LLMs for Audio Understanding - British Accents

Core recommendation

1. Why this is timely

English ASR is not the same as British-accent ASR

Accent errors can be linguistically meaningful

2. Existing benchmarks to connect to

3. Recommended benchmark tracks

Track 1 — ASR robustness

Track 2 — Accent identification

Track 3 — Accent-aware ASR

Track 4 — Pronunciation probes

Track 5 — Fairness and group robustness

4. The key LLM experiment: audio-only vs transcript-only

Core question

5. Model families worth testing

ASR models

Audio-LLMs

Accent classifiers / encoders

6. Dataset design essentials

Use speaker-disjoint splits

Keep raw and normalized transcripts

Store broad + fine accent labels

Add prompt-level metadata

Release in a no-script Hugging Face format

7. Suggested result tables

ASR summary

Accent classification

Audio-LLM ablation

8. Contribution statements you can use

Dataset contribution

Benchmark contribution

Audio-LLM contribution

Fairness contribution

Linguistic contribution

9. Good title options

Final compact recommendation

Related topics