Opensource LLMs for Audio Understanding - British Accents

Hello

I’m working on a dataset including audio, transcript, gender, and accent from 120 volunteers with different British accents.
I firstly wanted to use ASR to check the ability of models.
Then, use audio classification models for accent detection and also fine-tune models.

But, I’m also looking for directions with some focus on LLMs. If there is a benchmark of opensource LLMs for audio understanding, then I would appreciate it or if you have any other suggestions for contribution on this dataset.

Thanks!

If you plan to use the dataset for multiple purposes—such as for ASR, TTS, or variations like multimodal LLMs—creating rich metadata (labels) for the dataset during its creation will give you more options later on regarding which model to train and for what purpose. (Rather than the dataset having a fixed structure, you select exactly which data to use and how to use it from within the dataset via settings or scripts immediately before fine-tuning; therefore, the more clues the dataset provides, the more versatile it becomes.)

It’s a trade-off with the effort required for labeling, though…


Core recommendation

Your dataset should be framed as more than an ASR dataset. The strongest contribution is:

A British Accent Audio Understanding Benchmark for open ASR models, speech encoders, accent classifiers, and audio-LLMs.

That framing lets you evaluate five things separately:

Capability Core question Model types
ASR What words were spoken? Whisper, Qwen3-ASR, Parakeet, Canary, Voxtral transcription mode
Accent ID Which British accent group is this? WavLM, XLS-R, ECAPA-TDNN, Whisper encoder + classifier
Audio-LLM understanding Can the model listen and answer structured questions about transcript, accent, pronunciation, and uncertainty? Voxtral, Qwen2.5-Omni, Phi-4-multimodal, Kimi-Audio, Ultravox
Pronunciation probing Can the model hear British-specific phonetic features? Audio-LLMs, speech encoders, probe classifiers
Fairness / robustness Which accent/gender groups are underserved? Any ASR or audio model

Your dataset is valuable because it combines audio + transcript + accent + gender. That combination supports transcription, classification, instruction-following audio tasks, and group-level fairness analysis in one benchmark.

A closely related British Isles accent corpus contains 31+ hours from 120 volunteers self-identifying as speakers of Southern England, Midlands, Northern England, Welsh, Scottish, and Irish English varieties, showing that this scale is credible for British-accent speech research. (ACL Anthology)


1. Why this is timely

English ASR is not the same as British-accent ASR

General English ASR benchmarks often underrepresent accent diversity. EdAcc was created to better represent global English variation and contains almost 40 hours of dyadic English conversations from speakers with diverse accents and linguistic backgrounds. (Edinburgh DataShare)

So your research question should not be:

Can the model transcribe English?

It should be:

Can open ASR and audio-LLM systems handle British regional accents fairly, robustly, and transparently?

Accent errors can be linguistically meaningful

A Newcastle English ASR error-analysis paper links ASR errors to regional dialectal features, including phonological, lexical, and morphosyntactic variation. That supports going beyond WER into error categories such as vowel differences, dialect words, local pronouns, and place-name recognition. (ISCA Archive)

Useful error categories for your work:

Error type Example
Phonological TRAP–BATH, rhoticity, glottal /t/
Lexical “nowt” → “not” or “nothing”
Morphosyntactic “yous” → “you”
Named entity local town/station name mistranscribed
Over-normalization dialect form converted to standard English
Hallucination extra phrase inserted after silence

2. Existing benchmarks to connect to

There are relevant open audio/speech benchmarks, but none fully covers British regional accent understanding. That gap is your opportunity.

Benchmark / resource What it gives you How to use it
AudioBench AudioLLM benchmark with 8 tasks and 26 datasets across speech, audio scenes, and paralinguistic/voice understanding. (arXiv) Use as the main precedent for audio-LLM evaluation beyond ASR.
AIR-Bench Large audio-language model benchmark with 19 tasks / ~19k multiple-choice questions plus ~2k open-ended QA examples. (ACL Anthology) Use for multiple-choice accent and pronunciation probes.
Dynamic-SUPERB Phase 2 Instruction-based speech/audio benchmark expanded to 180 tasks. (arXiv) Use as a template for “listen and classify / transcribe / answer” tasks.
Open ASR Leaderboard Reproducible ASR benchmark comparing 60+ systems across 11 datasets, with WER and RTFx. (Hugging Face) Use for ASR-style reporting: WER + speed + standardized normalization.
CommonAccent Accent-classification recipe using ECAPA-TDNN and Wav2Vec2/XLS-R on Common Voice. (arXiv) Use for accent-ID baselines and classifier design.
ASR-FAIRBENCH Fairness-aware ASR benchmark combining WER with a fairness score. (arXiv) Use for accent/gender fairness framing.
Vox-Profile Speaker/speech trait benchmark including accent, sex, age, voice quality, emotion, and speech flow. (arXiv) Use as precedent for treating accent/gender as speech traits, with careful ethics.

3. Recommended benchmark tracks

Track 1 — ASR robustness

Question: How well do open ASR models transcribe different British accents?

Evaluate:

audio → transcript

Use metrics beyond average WER:

Metric Why
Overall WER / CER Basic transcription accuracy
WER by accent Core robustness measure
WER by gender Broad gender-related gap
WER by accent × gender Intersectional gap
Macro-accent WER Treats accents equally
Worst-accent WER Finds most underserved group
Accent WER gap Worst minus best
Substitution / deletion / insertion rates Separates wrong words, missed words, hallucinations
Dialect-word recall Tests words like “nowt,” “aye,” “wee,” etc.
Place-name recall Tests local named entities
RTFx / latency Practical deployability

This track answers:

Which open models transcribe British-accented speech best, and which accent groups remain hard?


Track 2 — Accent identification

Question: Can models identify British regional accent groups from raw audio?

Evaluate:

audio → accent_label

Use broad labels for the official benchmark:

Southern England
Midlands
Northern England
Welsh
Scottish
Irish
Northern Irish
uncertain / other

Keep fine labels as metadata:

accent_label_broad = "Northern England"
accent_label_fine = "Newcastle / Tyneside"
accent_self_reported = "Geordie"

Recommended baseline ladder:

Level Model Why
0 Majority-class baseline Sanity check
1 MFCC + logistic regression / SVM Classical baseline
2 ECAPA-TDNN embeddings + classifier Strong speaker/acoustic embedding baseline
3 WavLM embeddings + classifier Strong speech representation baseline
4 XLS-R / Wav2Vec2 embeddings + classifier Cross-lingual speech representation baseline
5 Whisper encoder embeddings + classifier Tests whether ASR encoders preserve accent cues
6 Fine-tuned WavLM / XLS-R Strong supervised classifier
7 Audio-LLM prompted classifier Tests instruction-following without a classifier head

CommonAccent provides a direct precedent for ECAPA-TDNN and Wav2Vec2/XLS-R accent-classification recipes. (arXiv)

Use metrics:

accuracy
macro-F1
balanced accuracy
per-accent recall
top-2 accuracy
confusion matrix
confidence calibration

Important: official results must use speaker-disjoint splits. Random clip splits can leak speaker identity and inflate accent-classification scores.


Track 3 — Accent-aware ASR

Question: Can an audio-LLM transcribe speech and identify the accent in one structured response?

Evaluate:

audio + instruction → transcript + accent_label + confidence + evidence

Example output:

{
  "transcript": "I went down to the station this morning.",
  "accent_label": "Northern England",
  "confidence": 0.72,
  "evidence": [
    "short BATH vowel",
    "non-rhotic pronunciation",
    "regional vowel quality"
  ]
}

Suggested prompt:

Listen to the audio and return only valid JSON.

Use this schema:
{
  "transcript": string,
  "accent_label": one of [
    "Southern England",
    "Midlands",
    "Northern England",
    "Welsh",
    "Scottish",
    "Irish",
    "Northern Irish",
    "uncertain"
  ],
  "confidence": number from 0 to 1,
  "evidence": list of at most 3 short strings
}

If the accent is unclear, use "uncertain".
Do not invent labels.

Metrics:

Metric Why
Transcript WER ASR quality
Accent accuracy Accent-label correctness
Joint score Transcript acceptable + accent correct
Valid JSON rate Instruction-following reliability
Allowed-label rate Whether the model obeys the label set
Confidence calibration Whether confidence matches correctness
Evidence quality Whether explanations are plausible
Uncertainty behavior Whether “uncertain” is used appropriately

This is probably the cleanest LLM-focused task in your project.


Track 4 — Pronunciation probes

Question: Can models hear British-English pronunciation features?

ASR alone cannot capture this. Two speakers may both say “bath”, but one may use a short TRAP vowel and another a long BATH/PALM vowel. The transcript is identical; the pronunciation is not.

Useful probes:

Feature Example words Task
TRAP–BATH split bath, path, grass, dance short TRAP vs long BATH/PALM
Rhoticity car, park, hard, farm /r/ pronounced or not
Glottal /t/ water, bottle, little glottalized /t/ or not
STRUT vowel cup, luck, strut northern/southern-style vowel quality
Dialect words nowt, owt, aye, wee which word was spoken
Local pronouns yous, wor which form was spoken
Place names local towns/stations which place name was spoken

Example multiple-choice tasks:

In the word "bath", does the speaker use:
A. short TRAP vowel
B. long BATH/PALM vowel
C. unclear
In the word "car", is the /r/ pronounced?
A. yes
B. no
C. unclear
Which word was spoken?
A. nothing
B. nowt
C. not
D. unclear

Metrics:

overall probe accuracy
per-feature accuracy
per-accent accuracy
abstention rate
human agreement

This track is where your dataset becomes more than an ASR benchmark. It becomes a British accent understanding benchmark.


Track 5 — Fairness and group robustness

Question: Which accent/gender groups are underserved?

Use gender mainly as evaluation metadata, not necessarily as a prediction target.

Recommended metrics:

Metric Meaning
Overall WER Average transcription quality
Macro-accent WER Treats accents equally
Worst-accent WER Most underserved accent
Best-accent WER Easiest accent
Accent WER gap Worst minus best
Accent WER ratio Worst divided by best
Gender WER gap Difference by gender
Accent × gender WER gap Intersectional difference
Worst-accent recall Accent classifier’s weakest group
Calibration by group Whether confidence is equally reliable

This matters because average WER can hide harm:

Model Overall WER Worst-accent WER Accent gap
Model A 7.8 18.5 13.2
Model B 8.4 13.1 7.4

Model A looks better on average; Model B may be more inclusive.

ASR-FAIRBENCH gives a useful precedent for combining accuracy and equity in ASR evaluation. (arXiv)


4. The key LLM experiment: audio-only vs transcript-only

This is the experiment I would emphasize most.

Core question

Do audio-LLMs actually hear accent cues, or do they infer accent from words in the transcript?

Run four conditions:

Condition Input What it tests
Audio only audio Can the model hear accent directly?
Gold transcript only human transcript Can text alone reveal accent from dialect words?
ASR transcript only model transcript How strong is a standard ASR → LLM pipeline?
Audio + transcript both Does multimodal input help?

Why this matters:

A text-only LLM may guess Scottish from words like:

wee
aye
ken

or Northern England from:

nowt
owt
yous

But that does not prove it heard the accent. The audio-only condition tests whether the model uses acoustic information. This ablation would make your LLM contribution much stronger.


5. Model families worth testing

ASR models

Model family Why include
Qwen3-ASR Current open ASR family; Qwen3-ASR supports language identification and ASR for 52 languages and dialects. (Hugging Face)
Whisper / Whisper turbo Standard ASR baseline family.
Distil-Whisper Efficient English ASR baseline.
NVIDIA Parakeet / Canary Non-Whisper ASR baselines; useful for speed/accuracy diversity.
Voxtral transcription mode Audio-LLM with dedicated transcription mode and long-context audio support. (Hugging Face)

Audio-LLMs

Model Why include
Voxtral Mini / Small Audio model with transcription, Q&A, summarization, and long-context audio support. (Hugging Face)
Qwen2.5-Omni General multimodal audio/video/text model.
Phi-4-multimodal-instruct Compact multimodal model with audio input.
Kimi-Audio-7B-Instruct Open audio foundation model for understanding, generation, and conversation.
Ultravox Direct speech-to-LLM style model.

Accent classifiers / encoders

Model type Why include
MFCC + SVM/logistic regression Simple baseline
ECAPA-TDNN embeddings Strong speaker/acoustic embedding baseline
WavLM embeddings Strong speech representation baseline
XLS-R / Wav2Vec2 embeddings Cross-lingual speech representation baseline
Whisper encoder embeddings Tests whether ASR features contain accent information
Fine-tuned WavLM / XLS-R Likely strong supervised accent classifiers

WavLM is a good classification base because its model card describes it as an English pretrained speech model intended for downstream tasks including speech recognition and audio classification. (Hugging Face)


6. Dataset design essentials

Use speaker-disjoint splits

Bad:

Speaker 001 clip 1 → train
Speaker 001 clip 2 → test

Good:

Speaker 001 → train only
Speaker 024 → validation only
Speaker 047 → test only

This is non-negotiable for accent classification. Otherwise the model may learn speaker identity, microphone, room acoustics, or session artifacts.

Keep raw and normalized transcripts

Example:

Raw:        "Aye, I've got nowt to do with it."
Normalized: "aye ive got nowt to do with it"

Do not normalize dialect words into standard English:

nowt → nothing
aye → yes
wee → small

That would erase the phenomenon you want to study.

Store broad + fine accent labels

accent_label_broad = "Northern England"
accent_label_fine = "Newcastle / Tyneside"
accent_self_reported = "Geordie"
accent_label_source = "self-reported"

Add prompt-level metadata

Field Example
prompt_id bath_001
target_words bath,path,grass
target_feature TRAP_BATH
contains_dialect_word true
contains_place_name true

Release in a no-script Hugging Face format

Use Parquet or AudioFolder with metadata CSV/JSONL. Hugging Face’s AudioFolder is designed to load audio datasets with thousands of files without requiring custom code. Dataset cards are also recommended to document contents, creation process, use context, and potential biases. (Hugging Face)

Suggested fields:

Field Purpose
audio_id Unique clip ID
speaker_id_hash Pseudonymous speaker ID
split Train/validation/test
audio Audio file or audio column
duration_sec Clip duration
transcript_raw Original transcript
transcript_normalized WER transcript
accent_label_broad Main benchmark label
accent_label_fine Optional detailed label
accent_self_reported Original self-description
accent_label_source Label provenance
gender_self_described Fairness slicing, if consented
target_feature Pronunciation feature
recording_condition Clean/noisy/phone/unknown
consent_scope Usage permission summary

7. Suggested result tables

ASR summary

Model Overall WER Macro-accent WER Worst-accent WER Accent gap RTFx
Qwen3-ASR-1.7B
Parakeet
Canary-Qwen
Whisper large-v3
Whisper large-v3-turbo
Distil-Whisper
Voxtral Mini

Accent classification

Model Accuracy Macro-F1 Balanced accuracy Worst-accent recall Top-2 accuracy
Majority baseline
MFCC + SVM
ECAPA + classifier
WavLM + classifier
XLS-R + classifier
Whisper encoder + classifier
Audio-LLM prompt

Audio-LLM ablation

Model Audio only Gold transcript only ASR transcript only Audio + transcript
Voxtral Mini
Qwen2.5-Omni
Phi-4-multimodal
Kimi-Audio

This last table is the most important one for your LLM-focused contribution.


8. Contribution statements you can use

Dataset contribution

We release a speaker-disjoint British regional accent speech dataset with audio, transcripts, gender metadata, accent labels, raw transcripts, normalized transcripts, and benchmark splits.

Benchmark contribution

We introduce a benchmark for British accent audio understanding, covering ASR robustness, accent identification, accent-aware ASR, pronunciation probes, and accent × gender fairness.

Audio-LLM contribution

We test whether audio-LLMs identify accents from acoustic cues or infer them from transcript/dialect words by comparing audio-only, gold-transcript-only, ASR-transcript-only, and audio+transcript settings.

Fairness contribution

We show that average WER is insufficient for British-accent evaluation and report macro-accent WER, worst-accent WER, accent gap, and accent × gender gaps.

Linguistic contribution

We connect model errors to British English pronunciation and dialect features, including TRAP–BATH variation, rhoticity, glottal /t/, dialect words, local pronouns, and place names.


9. Good title options

Title Emphasis
British Accent Audio Understanding Benchmark General benchmark framing
Benchmarking Open ASR Models and Audio-LLMs on British Regional Accents Model evaluation
British Regional Accent Robustness for ASR and Audio-Language Models Robustness/fairness
Accent-Aware Speech Recognition and Audio Understanding for British English ASR + audio-LLM
Do Audio-LLMs Hear British Accents? LLM-focused
Beyond WER: Evaluating British Accent Understanding in Open Speech Models Pronunciation/fairness

Final compact recommendation

The strongest version of your project is:

A speaker-disjoint British Accent Audio Understanding Benchmark with ASR, accent ID, accent-aware ASR, pronunciation probes, audio-only vs transcript-only LLM ablations, and accent × gender fairness metrics.

This is stronger than a simple ASR dataset because it asks:

  1. Can the model transcribe?
  2. Can the model detect accent?
  3. Can the model identify pronunciation features?
  4. Can the model reason over audio with structured prompts?
  5. Can the model perform fairly across accent and gender groups?
  6. Can audio-LLMs really use audio, or are they relying on transcript clues?

Short summary

  • Do ASR, but do not stop at ASR.
  • Use audio classifiers for accent ID.
  • Use audio-LLMs for structured accent-aware tasks.
  • Add pronunciation probes to make the dataset linguistically meaningful.
  • Use speaker-disjoint splits.
  • Report per-accent, per-gender, and accent × gender metrics.
  • Keep raw + normalized transcripts.
  • Release in Parquet or AudioFolder with a strong dataset card.
  • Delay full audio-LLM fine-tuning until zero-shot results show a clear failure mode.