Gemma 4 e4b latency optimisations

Working on a banking assistant pipeline using Gemma 4 for ASR + normalization + intent extraction + entity extraction + QnA in a single flow. Current latency is around ~6s on an NVIDIA L4 and ~1.5s on an H200 for end-to-end inference.

Pipeline includes:

  • ASR

  • text normalization

  • fuzzy/phonetic name correction

  • single-pass intent + entity extraction

  • async FastAPI serving

I’m trying to reduce latency further, maybe less than 500 ms.

Questions:

  1. What are the best optimizations for Gemma 4 inference in production?

  2. Would vLLM/TensorRT-LLM/Flash Attention significantly help for this workload?

  3. Any recommendations around batching, quantization, KV cache, or async pipeline improvements?

  4. Has anyone optimized small structured-output workloads like this on L4 specifically?

Would love suggestions from people deploying Gemma/Qwen/Llama models in real-time systems.

Gemma 4 is an excellent model, but it isn’t well-suited for GPUs with older architectures. (This is generally true for models starting with Gemma 3.)

When it comes to latency, the issues can be broadly divided into two categories: those that can be resolved by changing the backend, and those that require a rethinking of the pipeline itself:


Gemma 4 E4B latency optimisation notes for a banking assistant pipeline

You are probably looking at a compound latency problem, not one missing flag.

Current setup:

  • Model: google/gemma-4-E4B-it or similar
  • Hardware:
    • NVIDIA L4: ~6s end-to-end
    • NVIDIA H200: ~1.5s end-to-end
  • Pipeline:
    • ASR
    • text normalization
    • fuzzy / phonetic name correction
    • intent extraction
    • entity extraction
    • QnA
    • async FastAPI serving
  • Target: ideally <500 ms

My main conclusion:

<500 ms is realistic for the common banking-command path only if the pipeline is decomposed.
<500 ms is unlikely for one all-in-one Gemma 4 E4B call that does audio → ASR → normalization → fuzzy matching → extraction → QnA on an L4.

The best path is not “just add FlashAttention” or “just use vLLM”. The best path is:

streaming ASR
  -> deterministic normalization
  -> external fuzzy / phonetic candidate lookup
  -> fast intent/entity path
  -> Gemma 4 E4B only for ambiguity, fallback, and QnA

1. Main likely causes

1.1 Gemma 4 E4B has an attention-backend constraint

Gemma 4 is not just a normal small dense decoder from a serving point of view. The important detail is its mixed attention layout:

Sliding/local attention layers:
  head_dim = 256

Global/full attention layers:
  global_head_dim = 512

That matters because the usual FlashAttention-2 path supports head dimensions up to 256, while Gemma 4 global attention layers need 512. See:

This is the key trap:

L4 can generally run FlashAttention-2.
But Gemma 4 E4B cannot be assumed to use FlashAttention-2 end-to-end,
because its global attention layers use global_head_dim=512.

Check logs for lines like:

Gemma4 model has heterogeneous head dimensions
Forcing TRITON_ATTN backend
Using AttentionBackendEnum.TRITON_ATTN
Using AttentionBackendEnum.FLASH_ATTN
Using AttentionBackendEnum.FLASHINFER

If your L4 run is forced onto TRITON_ATTN, that can explain a large part of the latency gap.


1.2 L4 vs H200 is a huge memory-bandwidth mismatch

The L4/H200 latency gap is plausible even before considering software. LLM inference, especially decode at small batch sizes, is often memory-bandwidth sensitive.

Relevant hardware context:

Approximate memory bandwidth:

L4:   ~300 GB/s
H200: ~4.8 TB/s

H200 / L4 bandwidth ratio:
  4800 / 300 ≈ 16x

So a big L4/H200 gap does not necessarily mean your code is broken. It may mean:

lower memory bandwidth
+ Gemma 4 attention fallback
+ long prefill
+ audio encoder cost
+ structured-output overhead
= multi-second L4 latency

1.3 Audio inside Gemma is convenient, but probably not the lowest-latency ASR path

Gemma 4 E2B/E4B supports audio input. See:

That is useful for prototyping and multimodal reasoning, but for a sub-500 ms banking assistant, I would not use Gemma as the default ASR engine.

In vLLM, Gemma 4’s multimodal path is not necessarily optimized the same way as the language-model path. The vLLM Gemma 4 guide and model implementation notes are worth reading:

For low-latency voice systems, use a dedicated streaming ASR path where possible:

Architecture-wise:

Bad for latency:
  full audio -> Gemma 4 -> ASR + extraction + QnA

Better:
  streaming ASR -> text normalization -> fuzzy lookup -> extraction/QnA

1.4 Your workload is probably prefill-bound, not decode-bound

Intent/entity extraction usually emits a tiny JSON object:

{
  "intent": "transfer_money",
  "amount_minor": 500000,
  "currency": "JPY",
  "recipient_candidate_id": "p_001",
  "needs_confirmation": true
}

That output may be only 30-80 tokens.

For short outputs, latency is often dominated by prefill: the model reading the prompt before producing the first token.

Your prompt may include:

  • banking policy
  • intent definitions
  • entity schema
  • tool descriptions
  • JSON schema
  • examples
  • normalization instructions
  • fuzzy-name candidate lists
  • retrieved QnA context

If that becomes 1K-4K+ tokens, your hot path is likely dominated by input processing, not generation.

Relevant reading:

Prompt layout matters.

Good layout:

fixed system prompt
fixed banking rules
fixed schema instructions
fixed examples
variable transcript
variable recipient candidates
variable account context

Bad layout:

timestamp
request id
variable user data
fixed system prompt
fixed schema
fixed examples

If variable content is at the top, prefix caching is much less useful.


1.5 Structured output is necessary, but it has latency and correctness traps

Structured output is the right choice for a banking assistant. But it is not free.

Useful docs:

For Gemma 4 specifically, also watch for thinking/parser issues:

Important point:

A faster JSON result is not necessarily an optimized constrained-JSON result.
It may be unconstrained text that happens to look like JSON.

For banking, verify:

Does every output parse?
Are required fields impossible to omit?
Are invalid enum values impossible?
Are extra fields blocked?
Can the model output prose before JSON?
Can it invent recipient IDs not in the candidate list?

A grammar can enforce JSON shape. Your application still needs to enforce banking semantics.


2. Best production optimizations

2.1 Split the system into multiple paths

Recommended architecture:

audio stream
  -> VAD / endpointing
  -> streaming ASR
  -> transcript partials
  -> deterministic normalization
  -> fuzzy / phonetic candidate retrieval
  -> fast intent/entity path
      -> if high confidence:
            policy validation + confirmation/tool call
      -> if ambiguous:
            Gemma 4 E4B short structured extraction
      -> if open-ended:
            Gemma 4 E4B QnA endpoint

Latency targets:

Path Target Notes
Simple command path <500 ms Realistic with streaming ASR + non-LLM preprocessing
Ambiguous Gemma extraction 500 ms-2 s More realistic on L4; faster on H200
Full audio → Gemma → extraction → QnA <500 ms Unlikely on L4
QnA streaming Optimize TTFT, not full completion latency

Common banking commands are usually limited enough for a fast path:

  • balance inquiry
  • recent transactions
  • transfer money
  • card lock/unlock
  • bill payment
  • recipient lookup
  • human handoff
  • branch/ATM/product/policy QnA

Do not use the full generative path for every deterministic command.


2.2 Move text normalization to code

Normalization should mostly be deterministic.

Examples:

"five thousand yen"       -> 5000 JPY
"tomorrow morning"        -> normalized date/time
"one two three four"      -> account number fragment
"oh" vs "zero"            -> digit correction
full-width / half-width   -> normalized Japanese text
kana / romaji variants    -> canonical search forms

This is faster and more auditable than relying on the LLM.

For banking, auditability matters. You want logs like:

{
  "surface": "five thousand yen",
  "normalized_amount_minor": 500000,
  "currency": "JPY",
  "rule": "currency_parser_v3"
}

2.3 Move fuzzy / phonetic name correction outside the LLM

Do candidate generation outside the model:

ASR transcript span
  -> text normalization
  -> phonetic expansion
  -> kana / romaji / kanji variants
  -> edit distance / token similarity
  -> account/contact/payee database lookup
  -> top-k candidates

Useful library:

Pass only the top candidates to the model:

{
  "heard_name": "sato ken",
  "candidates": [
    {
      "candidate_id": "p_001",
      "display_name": "佐藤 健",
      "relationship": "recent_payee",
      "score": 0.94
    },
    {
      "candidate_id": "p_002",
      "display_name": "斉藤 健",
      "relationship": "saved_contact",
      "score": 0.78
    }
  ]
}

Then Gemma decides:

Is the intent clear?
Is the recipient unambiguous?
Should the assistant ask for confirmation?

Do not pass hundreds of names into the prompt.


2.4 Separate extraction and QnA endpoints

Use different configs.

Extraction endpoint:

input: text only
output: shallow JSON
max_tokens: 32-96
temperature: 0
max_model_len: 1024-2048 initially
thinking: off only if schema enforcement is verified
MTP: off initially
prefix cache: on in production
schema: shallow

QnA endpoint:

input: text + compact retrieved/tool context
output: streamed natural language
max_tokens: 128-512+
temperature: low
MTP: test on/off
thinking: optional
structured output: off unless tool call needed

Reason:

Extraction is often prefill/schema-bound.
QnA is more decode-bound.

2.5 Keep the extraction schema shallow

Good hot-path schema:

{
  "intent": "transfer_money",
  "amount_minor": 500000,
  "currency": "JPY",
  "recipient_candidate_id": "p_001",
  "needs_confirmation": true
}

Avoid hot-path schemas like:

{
  "normalization_trace": [],
  "policy_analysis": {},
  "candidate_ranking_explanation": "",
  "tool_plan": [],
  "assistant_response": "",
  "debug_reasoning": ""
}

For hot extraction, use:

  • intent
  • entities
  • needs_confirmation
  • candidate_id
  • confidence or ambiguity_reason
  • fallback_code

Avoid:

  • long explanations
  • model reasoning traces
  • policy analysis
  • candidate ranking explanation
  • natural-language answer in the same extraction output

3. vLLM vs SGLang vs TensorRT-LLM vs FlashAttention

3.1 vLLM

Use vLLM as the first baseline.

Relevant docs:

Baseline command for text-only extraction testing:

vllm serve google/gemma-4-E4B-it \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.90 \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}'

Why text-only first?

Because you need to know whether the model path itself is fast
before adding audio, schema, fuzzy matching, and FastAPI orchestration.

Caveat: vLLM may force TRITON_ATTN for Gemma 4 E4B because of mixed head dimensions. If so, vLLM may be stable but not as fast as you expect.


3.2 SGLang

SGLang is worth testing for short structured extraction.

Relevant docs:

Where SGLang may help:

short JSON extraction
stable repeated prompt prefixes
agentic / multi-step language programs
structured-output-heavy workloads

Caveat:

So start conservatively:

python -m sglang.launch_server \
  --model-path google/gemma-4-E4B-it \
  --mem-fraction-static 0.90

Avoid FP8 KV initially. Start with BF16/auto KV.

My recommendation:

vLLM baseline first.
SGLang A/B test for text-only structured extraction.
Start SGLang with BF16/auto KV, not FP8 KV.

3.3 TensorRT-LLM

TensorRT-LLM is worth testing, especially on H200, but not as the first fix.

Relevant links:

TensorRT-LLM is most attractive when:

hardware is H100/H200-class
deployment is NVIDIA-native
workload is stable
shapes are controlled
quantization path is validated
structured-output requirements are supported

Before committing, validate:

Gemma 4 E4B exact checkpoint
audio path
guided decoding / structured output
MTP / speculative decoding
KV-cache reuse
quantization format
L4 behavior
H200 behavior
p50/p95/p99 latency

3.4 FlashAttention

FlashAttention is not a simple fix here.

Relevant links:

Accurate summary:

L4 can generally run FlashAttention-2.
Gemma 4 E4B cannot be assumed to use FlashAttention-2 end-to-end,
because Gemma 4 global attention layers use global_head_dim=512.

Do not force FlashAttention unless your exact engine version validates that it supports Gemma 4’s mixed layout.


4. Attention backend guidance

For Gemma 4 E4B, always log the actual backend.

On L4

Backend View
auto Best first baseline
TRITON_ATTN Likely safe fallback, possibly slower
FLASH_ATTN Do not assume valid because global head dim is 512
FLASHINFER Test only if engine accepts it
FA3 Not the L4 answer
SDPA Debug/correctness fallback

On H200

Backend View
auto Best first baseline
advanced Hopper paths Worth testing if engine supports Gemma 4
TRITON_ATTN Safe fallback
FLASH_ATTN Still blocked if 512 global dim unsupported
FLASHINFER Worth testing only if accepted
SDPA Debug/correctness fallback

Main rule:

Do not choose a backend from generic benchmarks.
Choose based on what your engine actually uses for Gemma 4 E4B.

5. Batching recommendations

For real-time banking, do not optimize only for throughput. Optimize p95/p99 latency.

Use microbatching:

small max_num_seqs
small max_num_batched_tokens
minimal queue delay
short max_tokens
short prompt
prefix caching

Benchmark:

concurrency: 1, 2, 4, 8, 16
prompt tokens: 256, 512, 1024, 2048
output tokens: 16, 32, 64, 128
schema: off, shallow, production
audio: off, on

Extraction endpoint:

max_tokens: 32-96
temperature: 0
top_p: 1
schema: shallow
prompt: compact

QnA endpoint:

max_tokens: 128-512
streaming: on
MTP: test
retrieved context: capped

6. Quantization and KV cache

Start with BF16 / auto KV as the correctness baseline.

Then test separately:

BF16 weights + BF16/auto KV
quantized weights + BF16/auto KV
BF16 weights + FP8 KV
quantized weights + FP8 KV

Do not assume quantization improves latency. Sometimes it improves memory footprint but hurts latency if the kernel or dequantization path is poor.

For KV cache:

  • use prefix caching for stable prompts
  • avoid FP8 KV as the first SGLang Gemma 4 E4B test
  • validate entity accuracy and JSON correctness after quantization

Relevant links:


7. MTP / speculative decoding

Gemma 4 supports MTP-style acceleration.

Relevant links:

But MTP mostly helps decode-heavy workloads.

For this pipeline:

Workload MTP usefulness
30-80 token JSON extraction probably limited
short answer, 100-200 tokens worth testing
longer QnA, 256-512+ tokens more likely useful
audio preprocessing no direct help
prompt prefill no direct help
fuzzy name correction no help

Use MTP for QnA experiments first, not the tiny extraction JSON path.

Example:

vllm serve google/gemma-4-E4B-it \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}' \
  --speculative-config '{"method":"mtp","model":"google/gemma-4-E4B-it-assistant","num_speculative_tokens":4}'

Verify the exact syntax against your installed vLLM version.


8. FastAPI / async pipeline improvements

FastAPI is probably not the main cause of 6s latency, but it can become visible after model latency drops.

Avoid:

loading tokenizer/model/client per request
building huge schemas per request
CPU-bound fuzzy matching in the event loop
audio resampling in the event loop
blocking HTTP calls inside async handlers
unbounded request queues
serial execution of independent steps

Use:

persistent model client
connection pooling
uvloop
orjson
bounded queues
timeouts
request cancellation
process pool for CPU-bound fuzzy matching
streaming responses
early ASR partials
parallel normalization and candidate lookup

Instrument stages:

request_received
audio_upload_done
asr_start
asr_partial
asr_final
normalization_start
normalization_end
fuzzy_lookup_start
fuzzy_lookup_end
llm_request_start
llm_request_end
tool_validation_start
tool_validation_end
response_sent

Separate:

model latency
pipeline latency
queueing latency
network latency
CPU preprocessing latency

9. Benchmark plan

Phase 1: text-only LLM baseline

Run Gemma 4 E4B without audio and without structured output.

vllm serve google/gemma-4-E4B-it \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.90 \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}'

Test:

Prompt tokens Output tokens
128 32
512 32
1024 32
2048 32

Record:

attention backend
TTFT
TPOT / ITL
total latency
GPU utilization
GPU memory
p50 / p95 / p99

If this is already slow on L4, focus on model/backend/hardware.


Phase 2: structured-output overhead

Run the same text-only prompt in three modes:

free text
JSON instruction only
constrained JSON schema

Measure:

latency delta
JSON parse rate
schema validity
invalid enum prevention
required field enforcement

Phase 3: real banking extraction prompt

Add:

intent definitions
entity definitions
confirmation rules
recipient candidates
small policy summary

Sweep prompt sizes:

256
512
1024
2048
4096

If TTFT grows sharply, you are prefill-bound.


Phase 4: audio breakdown

Measure audio separately:

audio read/upload
decode/resample
feature extraction
audio encoder
transcription
normalization
LLM extraction
structured output

Do not diagnose only from end-to-end audio latency.


Phase 5: vLLM vs SGLang

Compare only the text structured extraction path first:

same model
same prompt
same schema
same max_tokens
same GPU
same concurrency
same KV dtype
same warmup

Test:

vLLM BF16/auto KV
SGLang BF16/auto KV
vLLM structured output
SGLang structured output
vLLM prefix caching
SGLang prefix-reuse behavior

Phase 6: QnA with MTP

After extraction is understood, test:

QnA max_tokens=128
QnA max_tokens=256
QnA max_tokens=512

Compare:

MTP off
MTP on

MTP is more likely to help here than in tiny JSON extraction.


10. Concrete next-step checklist

  1. Run text-only Gemma 4 E4B with no schema and no audio.
  2. Check attention backend logs.
  3. Measure TTFT and TPOT separately.
  4. Add shallow JSON schema and measure overhead.
  5. Add real banking prompt and run prompt-length sweep.
  6. Measure audio separately.
  7. Move normalization and fuzzy matching outside Gemma.
  8. Use prefix caching with stable prompt layout.
  9. A/B test vLLM vs SGLang for text-only structured extraction.
  10. Test MTP only for QnA-length outputs.
  11. Report p50/p95/p99, not only average latency.
  12. Treat sub-500 ms as the target for the simple command path, not the all-in-one Gemma path.

Final practical answer

For this specific setup:

Production default:
  vLLM first

Structured-extraction challenger:
  SGLang

H200 optimization candidate:
  TensorRT-LLM

Attention backend:
  auto first, log actual backend, expect TRITON_ATTN risk

FlashAttention:
  not a simple fix because Gemma 4 global layers use head_dim=512

ASR:
  move out of Gemma for the hot path

Normalization:
  move to deterministic code

Fuzzy name correction:
  move to candidate-generation service

Gemma 4 E4B:
  use for ambiguity, fallback extraction, and QnA

500 ms:
  realistic for decomposed simple command path
  unlikely for all-in-one audio -> Gemma -> extraction -> QnA on L4

Short version:

  • The current all-in-one Gemma 4 E4B flow is unlikely to hit <500 ms on L4.
  • The most important model-specific issue is Gemma 4’s mixed attention layout: head_dim=256 local layers and global_head_dim=512 global layers.
  • vLLM may force TRITON_ATTN, which can explain poor L4 latency.
  • FlashAttention is not a simple fix because of Gemma 4’s 512-dimensional global attention heads.
  • H200 is much faster largely because it has far more memory bandwidth and better high-end inference headroom.
  • Move ASR, normalization, and fuzzy name correction outside Gemma.
  • Use Gemma 4 E4B for ambiguity, fallback extraction, and QnA.
  • Use short prompts, shallow schemas, prefix caching, and separate extraction/QnA endpoints.
  • Benchmark vLLM first, SGLang for structured extraction, and TensorRT-LLM for H200 production experiments.