Gemma 4 is an excellent model, but it isn’t well-suited for GPUs with older architectures. (This is generally true for models starting with Gemma 3.)
When it comes to latency, the issues can be broadly divided into two categories: those that can be resolved by changing the backend, and those that require a rethinking of the pipeline itself:
Gemma 4 E4B latency optimisation notes for a banking assistant pipeline
You are probably looking at a compound latency problem, not one missing flag.
Current setup:
- Model:
google/gemma-4-E4B-it or similar
- Hardware:
- NVIDIA L4: ~6s end-to-end
- NVIDIA H200: ~1.5s end-to-end
- Pipeline:
- ASR
- text normalization
- fuzzy / phonetic name correction
- intent extraction
- entity extraction
- QnA
- async FastAPI serving
- Target: ideally <500 ms
My main conclusion:
<500 ms is realistic for the common banking-command path only if the pipeline is decomposed.
<500 ms is unlikely for one all-in-one Gemma 4 E4B call that does audio → ASR → normalization → fuzzy matching → extraction → QnA on an L4.
The best path is not “just add FlashAttention” or “just use vLLM”. The best path is:
streaming ASR
-> deterministic normalization
-> external fuzzy / phonetic candidate lookup
-> fast intent/entity path
-> Gemma 4 E4B only for ambiguity, fallback, and QnA
1. Main likely causes
1.1 Gemma 4 E4B has an attention-backend constraint
Gemma 4 is not just a normal small dense decoder from a serving point of view. The important detail is its mixed attention layout:
Sliding/local attention layers:
head_dim = 256
Global/full attention layers:
global_head_dim = 512
That matters because the usual FlashAttention-2 path supports head dimensions up to 256, while Gemma 4 global attention layers need 512. See:
This is the key trap:
L4 can generally run FlashAttention-2.
But Gemma 4 E4B cannot be assumed to use FlashAttention-2 end-to-end,
because its global attention layers use global_head_dim=512.
Check logs for lines like:
Gemma4 model has heterogeneous head dimensions
Forcing TRITON_ATTN backend
Using AttentionBackendEnum.TRITON_ATTN
Using AttentionBackendEnum.FLASH_ATTN
Using AttentionBackendEnum.FLASHINFER
If your L4 run is forced onto TRITON_ATTN, that can explain a large part of the latency gap.
1.2 L4 vs H200 is a huge memory-bandwidth mismatch
The L4/H200 latency gap is plausible even before considering software. LLM inference, especially decode at small batch sizes, is often memory-bandwidth sensitive.
Relevant hardware context:
Approximate memory bandwidth:
L4: ~300 GB/s
H200: ~4.8 TB/s
H200 / L4 bandwidth ratio:
4800 / 300 ≈ 16x
So a big L4/H200 gap does not necessarily mean your code is broken. It may mean:
lower memory bandwidth
+ Gemma 4 attention fallback
+ long prefill
+ audio encoder cost
+ structured-output overhead
= multi-second L4 latency
1.3 Audio inside Gemma is convenient, but probably not the lowest-latency ASR path
Gemma 4 E2B/E4B supports audio input. See:
That is useful for prototyping and multimodal reasoning, but for a sub-500 ms banking assistant, I would not use Gemma as the default ASR engine.
In vLLM, Gemma 4’s multimodal path is not necessarily optimized the same way as the language-model path. The vLLM Gemma 4 guide and model implementation notes are worth reading:
For low-latency voice systems, use a dedicated streaming ASR path where possible:
Architecture-wise:
Bad for latency:
full audio -> Gemma 4 -> ASR + extraction + QnA
Better:
streaming ASR -> text normalization -> fuzzy lookup -> extraction/QnA
1.4 Your workload is probably prefill-bound, not decode-bound
Intent/entity extraction usually emits a tiny JSON object:
{
"intent": "transfer_money",
"amount_minor": 500000,
"currency": "JPY",
"recipient_candidate_id": "p_001",
"needs_confirmation": true
}
That output may be only 30-80 tokens.
For short outputs, latency is often dominated by prefill: the model reading the prompt before producing the first token.
Your prompt may include:
- banking policy
- intent definitions
- entity schema
- tool descriptions
- JSON schema
- examples
- normalization instructions
- fuzzy-name candidate lists
- retrieved QnA context
If that becomes 1K-4K+ tokens, your hot path is likely dominated by input processing, not generation.
Relevant reading:
Prompt layout matters.
Good layout:
fixed system prompt
fixed banking rules
fixed schema instructions
fixed examples
variable transcript
variable recipient candidates
variable account context
Bad layout:
timestamp
request id
variable user data
fixed system prompt
fixed schema
fixed examples
If variable content is at the top, prefix caching is much less useful.
1.5 Structured output is necessary, but it has latency and correctness traps
Structured output is the right choice for a banking assistant. But it is not free.
Useful docs:
For Gemma 4 specifically, also watch for thinking/parser issues:
Important point:
A faster JSON result is not necessarily an optimized constrained-JSON result.
It may be unconstrained text that happens to look like JSON.
For banking, verify:
Does every output parse?
Are required fields impossible to omit?
Are invalid enum values impossible?
Are extra fields blocked?
Can the model output prose before JSON?
Can it invent recipient IDs not in the candidate list?
A grammar can enforce JSON shape. Your application still needs to enforce banking semantics.
2. Best production optimizations
2.1 Split the system into multiple paths
Recommended architecture:
audio stream
-> VAD / endpointing
-> streaming ASR
-> transcript partials
-> deterministic normalization
-> fuzzy / phonetic candidate retrieval
-> fast intent/entity path
-> if high confidence:
policy validation + confirmation/tool call
-> if ambiguous:
Gemma 4 E4B short structured extraction
-> if open-ended:
Gemma 4 E4B QnA endpoint
Latency targets:
| Path |
Target |
Notes |
| Simple command path |
<500 ms |
Realistic with streaming ASR + non-LLM preprocessing |
| Ambiguous Gemma extraction |
500 ms-2 s |
More realistic on L4; faster on H200 |
| Full audio → Gemma → extraction → QnA |
<500 ms |
Unlikely on L4 |
| QnA |
streaming |
Optimize TTFT, not full completion latency |
Common banking commands are usually limited enough for a fast path:
- balance inquiry
- recent transactions
- transfer money
- card lock/unlock
- bill payment
- recipient lookup
- human handoff
- branch/ATM/product/policy QnA
Do not use the full generative path for every deterministic command.
2.2 Move text normalization to code
Normalization should mostly be deterministic.
Examples:
"five thousand yen" -> 5000 JPY
"tomorrow morning" -> normalized date/time
"one two three four" -> account number fragment
"oh" vs "zero" -> digit correction
full-width / half-width -> normalized Japanese text
kana / romaji variants -> canonical search forms
This is faster and more auditable than relying on the LLM.
For banking, auditability matters. You want logs like:
{
"surface": "five thousand yen",
"normalized_amount_minor": 500000,
"currency": "JPY",
"rule": "currency_parser_v3"
}
2.3 Move fuzzy / phonetic name correction outside the LLM
Do candidate generation outside the model:
ASR transcript span
-> text normalization
-> phonetic expansion
-> kana / romaji / kanji variants
-> edit distance / token similarity
-> account/contact/payee database lookup
-> top-k candidates
Useful library:
Pass only the top candidates to the model:
{
"heard_name": "sato ken",
"candidates": [
{
"candidate_id": "p_001",
"display_name": "佐藤 健",
"relationship": "recent_payee",
"score": 0.94
},
{
"candidate_id": "p_002",
"display_name": "斉藤 健",
"relationship": "saved_contact",
"score": 0.78
}
]
}
Then Gemma decides:
Is the intent clear?
Is the recipient unambiguous?
Should the assistant ask for confirmation?
Do not pass hundreds of names into the prompt.
2.4 Separate extraction and QnA endpoints
Use different configs.
Extraction endpoint:
input: text only
output: shallow JSON
max_tokens: 32-96
temperature: 0
max_model_len: 1024-2048 initially
thinking: off only if schema enforcement is verified
MTP: off initially
prefix cache: on in production
schema: shallow
QnA endpoint:
input: text + compact retrieved/tool context
output: streamed natural language
max_tokens: 128-512+
temperature: low
MTP: test on/off
thinking: optional
structured output: off unless tool call needed
Reason:
Extraction is often prefill/schema-bound.
QnA is more decode-bound.
2.5 Keep the extraction schema shallow
Good hot-path schema:
{
"intent": "transfer_money",
"amount_minor": 500000,
"currency": "JPY",
"recipient_candidate_id": "p_001",
"needs_confirmation": true
}
Avoid hot-path schemas like:
{
"normalization_trace": [],
"policy_analysis": {},
"candidate_ranking_explanation": "",
"tool_plan": [],
"assistant_response": "",
"debug_reasoning": ""
}
For hot extraction, use:
intent
entities
needs_confirmation
candidate_id
confidence or ambiguity_reason
fallback_code
Avoid:
- long explanations
- model reasoning traces
- policy analysis
- candidate ranking explanation
- natural-language answer in the same extraction output
3. vLLM vs SGLang vs TensorRT-LLM vs FlashAttention
3.1 vLLM
Use vLLM as the first baseline.
Relevant docs:
Baseline command for text-only extraction testing:
vllm serve google/gemma-4-E4B-it \
--max-model-len 2048 \
--gpu-memory-utilization 0.90 \
--limit-mm-per-prompt '{"image": 0, "audio": 0}'
Why text-only first?
Because you need to know whether the model path itself is fast
before adding audio, schema, fuzzy matching, and FastAPI orchestration.
Caveat: vLLM may force TRITON_ATTN for Gemma 4 E4B because of mixed head dimensions. If so, vLLM may be stable but not as fast as you expect.
3.2 SGLang
SGLang is worth testing for short structured extraction.
Relevant docs:
Where SGLang may help:
short JSON extraction
stable repeated prompt prefixes
agentic / multi-step language programs
structured-output-heavy workloads
Caveat:
So start conservatively:
python -m sglang.launch_server \
--model-path google/gemma-4-E4B-it \
--mem-fraction-static 0.90
Avoid FP8 KV initially. Start with BF16/auto KV.
My recommendation:
vLLM baseline first.
SGLang A/B test for text-only structured extraction.
Start SGLang with BF16/auto KV, not FP8 KV.
3.3 TensorRT-LLM
TensorRT-LLM is worth testing, especially on H200, but not as the first fix.
Relevant links:
TensorRT-LLM is most attractive when:
hardware is H100/H200-class
deployment is NVIDIA-native
workload is stable
shapes are controlled
quantization path is validated
structured-output requirements are supported
Before committing, validate:
Gemma 4 E4B exact checkpoint
audio path
guided decoding / structured output
MTP / speculative decoding
KV-cache reuse
quantization format
L4 behavior
H200 behavior
p50/p95/p99 latency
3.4 FlashAttention
FlashAttention is not a simple fix here.
Relevant links:
Accurate summary:
L4 can generally run FlashAttention-2.
Gemma 4 E4B cannot be assumed to use FlashAttention-2 end-to-end,
because Gemma 4 global attention layers use global_head_dim=512.
Do not force FlashAttention unless your exact engine version validates that it supports Gemma 4’s mixed layout.
4. Attention backend guidance
For Gemma 4 E4B, always log the actual backend.
On L4
| Backend |
View |
| auto |
Best first baseline |
TRITON_ATTN |
Likely safe fallback, possibly slower |
FLASH_ATTN |
Do not assume valid because global head dim is 512 |
FLASHINFER |
Test only if engine accepts it |
| FA3 |
Not the L4 answer |
| SDPA |
Debug/correctness fallback |
On H200
| Backend |
View |
| auto |
Best first baseline |
| advanced Hopper paths |
Worth testing if engine supports Gemma 4 |
TRITON_ATTN |
Safe fallback |
FLASH_ATTN |
Still blocked if 512 global dim unsupported |
FLASHINFER |
Worth testing only if accepted |
| SDPA |
Debug/correctness fallback |
Main rule:
Do not choose a backend from generic benchmarks.
Choose based on what your engine actually uses for Gemma 4 E4B.
5. Batching recommendations
For real-time banking, do not optimize only for throughput. Optimize p95/p99 latency.
Use microbatching:
small max_num_seqs
small max_num_batched_tokens
minimal queue delay
short max_tokens
short prompt
prefix caching
Benchmark:
concurrency: 1, 2, 4, 8, 16
prompt tokens: 256, 512, 1024, 2048
output tokens: 16, 32, 64, 128
schema: off, shallow, production
audio: off, on
Extraction endpoint:
max_tokens: 32-96
temperature: 0
top_p: 1
schema: shallow
prompt: compact
QnA endpoint:
max_tokens: 128-512
streaming: on
MTP: test
retrieved context: capped
6. Quantization and KV cache
Start with BF16 / auto KV as the correctness baseline.
Then test separately:
BF16 weights + BF16/auto KV
quantized weights + BF16/auto KV
BF16 weights + FP8 KV
quantized weights + FP8 KV
Do not assume quantization improves latency. Sometimes it improves memory footprint but hurts latency if the kernel or dequantization path is poor.
For KV cache:
- use prefix caching for stable prompts
- avoid FP8 KV as the first SGLang Gemma 4 E4B test
- validate entity accuracy and JSON correctness after quantization
Relevant links:
7. MTP / speculative decoding
Gemma 4 supports MTP-style acceleration.
Relevant links:
But MTP mostly helps decode-heavy workloads.
For this pipeline:
| Workload |
MTP usefulness |
| 30-80 token JSON extraction |
probably limited |
| short answer, 100-200 tokens |
worth testing |
| longer QnA, 256-512+ tokens |
more likely useful |
| audio preprocessing |
no direct help |
| prompt prefill |
no direct help |
| fuzzy name correction |
no help |
Use MTP for QnA experiments first, not the tiny extraction JSON path.
Example:
vllm serve google/gemma-4-E4B-it \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--limit-mm-per-prompt '{"image": 0, "audio": 0}' \
--speculative-config '{"method":"mtp","model":"google/gemma-4-E4B-it-assistant","num_speculative_tokens":4}'
Verify the exact syntax against your installed vLLM version.
8. FastAPI / async pipeline improvements
FastAPI is probably not the main cause of 6s latency, but it can become visible after model latency drops.
Avoid:
loading tokenizer/model/client per request
building huge schemas per request
CPU-bound fuzzy matching in the event loop
audio resampling in the event loop
blocking HTTP calls inside async handlers
unbounded request queues
serial execution of independent steps
Use:
persistent model client
connection pooling
uvloop
orjson
bounded queues
timeouts
request cancellation
process pool for CPU-bound fuzzy matching
streaming responses
early ASR partials
parallel normalization and candidate lookup
Instrument stages:
request_received
audio_upload_done
asr_start
asr_partial
asr_final
normalization_start
normalization_end
fuzzy_lookup_start
fuzzy_lookup_end
llm_request_start
llm_request_end
tool_validation_start
tool_validation_end
response_sent
Separate:
model latency
pipeline latency
queueing latency
network latency
CPU preprocessing latency
9. Benchmark plan
Phase 1: text-only LLM baseline
Run Gemma 4 E4B without audio and without structured output.
vllm serve google/gemma-4-E4B-it \
--max-model-len 2048 \
--gpu-memory-utilization 0.90 \
--limit-mm-per-prompt '{"image": 0, "audio": 0}'
Test:
| Prompt tokens |
Output tokens |
| 128 |
32 |
| 512 |
32 |
| 1024 |
32 |
| 2048 |
32 |
Record:
attention backend
TTFT
TPOT / ITL
total latency
GPU utilization
GPU memory
p50 / p95 / p99
If this is already slow on L4, focus on model/backend/hardware.
Phase 2: structured-output overhead
Run the same text-only prompt in three modes:
free text
JSON instruction only
constrained JSON schema
Measure:
latency delta
JSON parse rate
schema validity
invalid enum prevention
required field enforcement
Phase 3: real banking extraction prompt
Add:
intent definitions
entity definitions
confirmation rules
recipient candidates
small policy summary
Sweep prompt sizes:
256
512
1024
2048
4096
If TTFT grows sharply, you are prefill-bound.
Phase 4: audio breakdown
Measure audio separately:
audio read/upload
decode/resample
feature extraction
audio encoder
transcription
normalization
LLM extraction
structured output
Do not diagnose only from end-to-end audio latency.
Phase 5: vLLM vs SGLang
Compare only the text structured extraction path first:
same model
same prompt
same schema
same max_tokens
same GPU
same concurrency
same KV dtype
same warmup
Test:
vLLM BF16/auto KV
SGLang BF16/auto KV
vLLM structured output
SGLang structured output
vLLM prefix caching
SGLang prefix-reuse behavior
Phase 6: QnA with MTP
After extraction is understood, test:
QnA max_tokens=128
QnA max_tokens=256
QnA max_tokens=512
Compare:
MTP off
MTP on
MTP is more likely to help here than in tiny JSON extraction.
10. Concrete next-step checklist
- Run text-only Gemma 4 E4B with no schema and no audio.
- Check attention backend logs.
- Measure TTFT and TPOT separately.
- Add shallow JSON schema and measure overhead.
- Add real banking prompt and run prompt-length sweep.
- Measure audio separately.
- Move normalization and fuzzy matching outside Gemma.
- Use prefix caching with stable prompt layout.
- A/B test vLLM vs SGLang for text-only structured extraction.
- Test MTP only for QnA-length outputs.
- Report p50/p95/p99, not only average latency.
- Treat sub-500 ms as the target for the simple command path, not the all-in-one Gemma path.
Final practical answer
For this specific setup:
Production default:
vLLM first
Structured-extraction challenger:
SGLang
H200 optimization candidate:
TensorRT-LLM
Attention backend:
auto first, log actual backend, expect TRITON_ATTN risk
FlashAttention:
not a simple fix because Gemma 4 global layers use head_dim=512
ASR:
move out of Gemma for the hot path
Normalization:
move to deterministic code
Fuzzy name correction:
move to candidate-generation service
Gemma 4 E4B:
use for ambiguity, fallback extraction, and QnA
500 ms:
realistic for decomposed simple command path
unlikely for all-in-one audio -> Gemma -> extraction -> QnA on L4
Short version:
- The current all-in-one Gemma 4 E4B flow is unlikely to hit <500 ms on L4.
- The most important model-specific issue is Gemma 4’s mixed attention layout:
head_dim=256 local layers and global_head_dim=512 global layers.
- vLLM may force
TRITON_ATTN, which can explain poor L4 latency.
- FlashAttention is not a simple fix because of Gemma 4’s 512-dimensional global attention heads.
- H200 is much faster largely because it has far more memory bandwidth and better high-end inference headroom.
- Move ASR, normalization, and fuzzy name correction outside Gemma.
- Use Gemma 4 E4B for ambiguity, fallback extraction, and QnA.
- Use short prompts, shallow schemas, prefix caching, and separate extraction/QnA endpoints.
- Benchmark vLLM first, SGLang for structured extraction, and TensorRT-LLM for H200 production experiments.