Fine-tuning microsoft/harrier-oss-v1-270m with SentenceTransformerTrainer — is it supported?

I recently fine-tuned BAAI/bge-m3 for a Portuguese QA retrieval task using SentenceTransformerTrainer with MultipleNegativesRankingLoss, and it works well.

I’d now like to try microsoft/harrier-oss-v1-270m as the base model, since it achieves better results on Multilingual MTEB v2. The model card confirms it is compatible with SentenceTransformers, so that part is clear.

However, I have some questions specific to fine-tuning this model:

  1. The model card states that queries should include a task instruction (e.g. Instruct: ... Query: ...) but documents should not. When fine-tuning with MultipleNegativesRankingLoss, should the instruction prefix be applied to the anchor texts during training, or only at inference?
  2. Are there any known challenges or recommended adaptations when fine-tuning decoder-only embedding models with SentenceTransformers, compared to encoder-based models like BGE-M3?
  3. Any recommended starting hyperparameters (learning rate, batch size) for this architecture?

Any guidance or pointers to examples would be appreciated.

Seems practically supported?


Fine-tuning microsoft/harrier-oss-v1-270m with SentenceTransformerTrainer

Yes — this should be a supported and reasonable setup, with one important caveat: I have not found a public example that exactly combines:

microsoft/harrier-oss-v1-270m
+ SentenceTransformerTrainer
+ MultipleNegativesRankingLoss
+ Portuguese QA retrieval

However, the evidence strongly points to this being a valid path:

  • microsoft/harrier-oss-v1-270m is packaged as a sentence-transformers model and can be loaded with SentenceTransformer("microsoft/harrier-oss-v1-270m").
  • The model card explicitly shows Sentence Transformers usage and encodes queries with a prompt while encoding documents without a prompt.
  • The model card says Harrier uses a decoder-only architecture, last-token pooling, and L2-normalized embeddings.
  • The model card also says query instructions are how the model is trained and that omitting them can degrade performance; document-side instructions are not needed.
  • SentenceTransformerTrainingArguments supports training-time prompts, including column-specific prompt mappings.
  • MultipleNegativesRankingLoss is the standard Sentence Transformers loss for positive (query, document) / (anchor, positive) retrieval pairs.
  • Nearby public examples exist, especially a Harrier-family Vietnamese legal retrieval model using Sentence Transformers + MNRL, and SkillRet-style decoder-embedding fine-tuning using query instructions and unprompted documents.

My recommendation is:

training query / anchor:      Instruct: ...\nQuery: <Portuguese question>
training document / positive: <raw Portuguese passage>

inference query:              Instruct: ...\nQuery: <Portuguese question>
inference document:           <raw Portuguese passage>

So: apply the instruction prefix to the query/anchor side during training, not only at inference. Keep documents/passages unprompted.


1. Should the instruction be applied during training?

Yes. For Harrier, the instruction is not just an inference-time decoration. It is part of the expected query-side input format.

The Harrier-270M model card shows this Sentence Transformers pattern:

query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)

The same model card also shows the raw Transformers pattern:

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f"Instruct: {task_description}\nQuery: {query}"

# Each query must come with a one-sentence instruction that describes the task.
# No need to add instruction for retrieval documents.

The FAQ is especially relevant: it says query instructions are how the model is trained, omitting them causes degradation, and document-side instructions are not needed.

So for Portuguese QA retrieval, I would train with:

query / anchor:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: <question>

document / positive:
<passage>

and infer with exactly the same policy:

query:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: <question>

document:
<passage>

This avoids a train/inference mismatch.

Bad pattern

training query:
Qual é o prazo para interpor recurso administrativo?

inference query:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: Qual é o prazo para interpor recurso administrativo?

This fine-tunes the model on raw questions but deploys it on prompted questions. For Harrier, that is probably the wrong distribution.

Better pattern

training query:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: Qual é o prazo para interpor recurso administrativo?

inference query:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: Qual é o prazo para interpor recurso administrativo?

Documents should remain raw in both training and inference:

training document:
O prazo para interposição de recurso administrativo é de 10 dias úteis...

indexed document:
O prazo para interposição de recurso administrativo é de 10 dias úteis...

2. How to apply query-only prompts in SentenceTransformerTrainer

If your dataset columns are named query and document, use column-specific prompts:

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    losses,
)
from sentence_transformers.training_args import BatchSamplers

model = SentenceTransformer(
    "microsoft/harrier-oss-v1-270m",
    model_kwargs={"dtype": "auto"},
)

query_prompt = (
    "Instruct: Given a Portuguese question, "
    "retrieve relevant Portuguese passages that answer the question\n"
    "Query: "
)

args = SentenceTransformerTrainingArguments(
    output_dir="harrier-270m-pt-qa-mnrl",

    per_device_train_batch_size=8,
    gradient_accumulation_steps=16,  # effective batch size 128 on 1 GPU

    learning_rate=5e-6,
    num_train_epochs=1,
    warmup_ratio=0.10,
    lr_scheduler_type="cosine",

    bf16=True,
    gradient_checkpointing=True,

    batch_sampler=BatchSamplers.NO_DUPLICATES,

    prompts={
        "query": query_prompt,
        "document": "",
    },

    logging_steps=50,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,
)

loss = losses.MultipleNegativesRankingLoss(
    model,
    directions=("query_to_doc",),
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,  # columns: query, document
    loss=loss,
)

trainer.train()
trainer.save_model("harrier-270m-pt-qa-mnrl/final")

If your dataset columns are named anchor and positive, change only the prompt mapping:

prompts={
    "anchor": query_prompt,
    "positive": "",
}

The important rule is simple:

query-like column:    prompt
document-like column: no prompt

3. Should the prompt be in English or Portuguese?

I would start with the English instruction format because it matches Harrier’s public prompt style:

Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query:

The query and document texts themselves should remain Portuguese.

After you have a baseline, test a Portuguese instruction as an ablation:

Instruct: Dada uma pergunta em português, recupere passagens em português relevantes que respondam à pergunta
Query:

I would not start by translating the structural markers to Instrução: and Consulta:. Keep Instruct: and Query: first, because that matches the Harrier format shown in the model card and config_sentence_transformers.json.

Recommended first prompt:

query_prompt = (
    "Instruct: Given a Portuguese question, "
    "retrieve relevant Portuguese passages that answer the question\n"
    "Query: "
)

4. Is MultipleNegativesRankingLoss appropriate?

Yes. For QA retrieval, your data usually has the form:

query:    Portuguese question
positive: passage that answers the question

That is a natural fit for MultipleNegativesRankingLoss, which is designed for positive pairs such as (query, response) or (anchor, positive).

A basic version is:

loss = losses.MultipleNegativesRankingLoss(model)

For clarity in retrieval, I would write:

loss = losses.MultipleNegativesRankingLoss(
    model,
    directions=("query_to_doc",),
)

This trains the model so that each query is closer to its matching passage than to other passages in the batch.


5. Main MNRL caveat: false negatives

MNRL uses other positives in the batch as negatives. That is efficient, but it can be harmful if some of those “negatives” are actually relevant.

Example:

query:
Como solicitar a segunda via da fatura?

positive A:
A segunda via da fatura pode ser solicitada no portal do cliente.

positive B for another query:
Para emitir uma cópia da fatura, acesse Minha Conta e clique em Segunda Via.

For the first query, positive B is not really negative. It probably answers the same question. If it appears in the same batch, MNRL may incorrectly push it away.

This is common in QA retrieval, FAQ retrieval, legal retrieval, policy retrieval, support retrieval, and any corpus with repeated answer templates.

Use:

batch_sampler=BatchSamplers.NO_DUPLICATES

The Sentence Transformers training overview specifically notes that losses using in-batch negatives benefit from no duplicate samples in a batch. The loss docs also discuss cached / larger-batch variants of MNRL.

Also deduplicate aggressively before training:

  • exact duplicate passages;
  • near-duplicate chunks;
  • boilerplate-heavy passages;
  • repeated FAQ answers;
  • multiple chunks from the same source document;
  • multiple positives that answer the same query.

6. Decoder-only Harrier vs encoder-style BGE-M3

Harrier and BGE-M3 should not be treated as interchangeable SBERT-style encoders.

Harrier-specific considerations

microsoft/harrier-oss-v1-270m is:

  • decoder-only;
  • multilingual;
  • 270M parameters;
  • 640-dimensional embeddings;
  • up to 32,768 tokens;
  • last-token pooled;
  • L2-normalized;
  • instruction-sensitive on the query side.

When used through SentenceTransformer, last-token pooling and normalization are handled automatically.

If using raw AutoModel, you must reproduce the model-card pooling behavior yourself:

def last_token_pool(last_hidden_states, attention_mask):
    left_padding = attention_mask[:, -1].sum() == attention_mask.shape[0]
    if left_padding:
        return last_hidden_states[:, -1]
    sequence_lengths = attention_mask.sum(dim=1) - 1
    batch_size = last_hidden_states.shape[0]
    return last_hidden_states[torch.arange(batch_size), sequence_lengths]

For this use case, I would stay with SentenceTransformer unless there is a strong reason not to.

BGE-M3-specific considerations

BAAI/bge-m3 is not just a dense embedding model. Its model card describes it as multi-functional, multilingual, and multi-granular:

  • dense retrieval;
  • sparse retrieval;
  • multi-vector retrieval;
  • more than 100 languages;
  • up to 8192 tokens.

This matters for a fair comparison. Do not compare:

BGE-M3 hybrid/sparse/multi-vector system
vs
Harrier dense-only system

and call that a model-only comparison.

Fairer comparisons are:

BGE-M3 dense vs Harrier dense
BGE-M3 hybrid vs Harrier dense + BM25
BGE-M3 + reranker vs Harrier + reranker

7. Recommended starting hyperparameters

For a first full-model Harrier-270M MNRL run, I would start conservatively.

Parameter Recommended first value
Base model microsoft/harrier-oss-v1-270m
Loss MultipleNegativesRankingLoss
Direction ("query_to_doc",)
Query prompt yes
Document prompt no
Learning rate 5e-6
LR candidates 3e-6, 5e-6, 1e-5
Epochs 1
Warmup ratio 0.10
Scheduler cosine
Precision bf16 if supported
Physical batch size 4–16, depending on GPU
Effective batch size 128–256
Batch sampler BatchSamplers.NO_DUPLICATES
Gradient checkpointing yes if memory-bound
Max sequence length 512 or 1024 first

I would not start with 5e-5 for full-model MNRL fine-tuning. Harrier is already a strong embedding model; the goal is adaptation, not overwriting its embedding geometry.

A useful nearby reference is mainguyen9/vietlegal-harrier-0.6b, a Harrier-family Vietnamese legal retrieval model that reports Sentence Transformers training, MNRL, hard-negative mining, LR 3e-6, batch size 256, one epoch, warmup 10%, cosine scheduler, and bf16. It is not the same model size or language, but it is a closer reference than generic BERT/SBERT defaults.


8. Should you use CachedMultipleNegativesRankingLoss?

Use it if your GPU memory prevents a useful effective batch size.

MNRL benefits from larger batches because larger batches provide more in-batch negatives. If normal MNRL is memory-bound, try:

loss = losses.CachedMultipleNegativesRankingLoss(
    model,
    mini_batch_size=32,
)

Then test effective batch sizes like:

256
512
1024

But I would not make cached MNRL the first experiment. First establish that the simple MNRL setup works.


9. Suggested experiment matrix

Run these in order.

Run Model Training Query prompt Doc prompt LR Effective batch Purpose
A BGE-M3 existing fine-tune current current current current incumbent baseline
B Harrier-270M none yes no zero-shot baseline
C Harrier-270M MNRL yes no 5e-6 128 main first run
D Harrier-270M MNRL yes no 3e-6 128–256 lower-LR check
E Harrier-270M MNRL yes no 1e-5 128–256 upper-LR check
F Harrier-270M MNRL no no 5e-6 128 prompt ablation
G Harrier-270M Cached MNRL yes no 5e-6 256–1024 batch-size check
H Harrier-270M hard-negative stage yes no 3e-6–5e-6 task-dependent ranking refinement

The most important comparison is:

fine-tuned BGE-M3
vs
zero-shot Harrier with query instruction
vs
fine-tuned Harrier with query instruction
vs
fine-tuned Harrier without query instruction

Leaderboard scores are useful for model shortlisting, but the final decision should be based on your own Portuguese QA retrieval benchmark.

For broader benchmark context, see MTEB, MMTEB, and the original MTEB paper. MTEB-style scores are useful, but they do not replace task-specific evaluation.


10. Evaluation metrics

Use the same evaluation pipeline for BGE-M3 and Harrier.

Minimum retrieval metrics:

nDCG@10
MRR@10
Recall@5
Recall@10
Recall@50
Recall@100

Why these metrics matter:

Metric What it tells you
Recall@50 / Recall@100 Whether the retriever can put the right passage somewhere in the candidate pool
Recall@5 / Recall@10 Whether the retriever is good enough for direct RAG context selection
MRR@10 Whether the first relevant passage appears early
nDCG@10 Ranking quality when there are multiple relevant passages

Also track operational metrics:

embedding throughput
query latency
index size
GPU memory
embedding dimension
chunk length
max sequence length

For Portuguese-specific external sanity checks, useful resources include:


11. Hard negatives: useful, but second stage

Do not start with hard negatives. Start with clean query-positive MNRL.

After the first baseline is stable:

1. Embed the full corpus.
2. Retrieve top 100 candidates per training query.
3. Remove known positives.
4. Skip the top few candidates if they may be unlabeled positives.
5. Sample negatives from ranks 20–100 or 50–100.
6. Train a second stage with explicit negatives or a hard-negative-aware setup.

The reason to avoid the top retrieved “negative” is that it may actually be a valid answer that was not labeled.

The SkillRet paper is a useful related reference. It fine-tunes decoder-style embedding models using MultipleNegativesRankingLoss, applies the same task-specific query instruction to anchor queries during training, uses no document prompt for Harrier/Qwen-style embedding models, and mines hard negatives for the reranker stage. It also reports that fine-tuning Harrier-OSS-0.6B and Qwen3-Embedding-0.6B gives nearly identical performance in that task, suggesting that the training recipe matters at least as much as the exact decoder-embedding base.


12. Common pitfalls

Pitfall 1: double prompting

Bad:

dataset query already contains:
Instruct: ...
Query: ...

and TrainingArguments also uses:
prompts={"query": "Instruct: ...\nQuery: "}

This produces:

Instruct: ...
Query: Instruct: ...
Query: <question>

Use one method:

Either store raw queries and use prompts=...
or store prompted queries and do not use prompts=...

I recommend storing raw queries and using prompts=....


Pitfall 2: prompting documents

Bad:

document:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: <passage>

For Harrier retrieval, documents should be raw passages.


Pitfall 3: train/inference mismatch

Bad:

training query:  raw question
inference query: prompted question

Better:

training query:  prompted question
inference query: prompted question

Pitfall 4: comparing systems unfairly

Bad:

BGE-M3 hybrid vs Harrier dense-only

Better:

BGE-M3 dense vs Harrier dense
BGE-M3 hybrid vs Harrier dense + BM25
BGE-M3 + reranker vs Harrier + reranker

Pitfall 5: starting with too much context

Harrier supports long context, but that does not mean a first fine-tune should use 32k tokens.

Start with:

512 or 1024 tokens

Then test:

2048
4096
8192

only if your evaluation set shows that longer passages help.

In retrieval, better chunking is often more useful than simply increasing max length.


13. Inference after fine-tuning

Use the same query prompt and raw documents:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("harrier-270m-pt-qa-mnrl/final")

query_prompt = (
    "Instruct: Given a Portuguese question, "
    "retrieve relevant Portuguese passages that answer the question\n"
    "Query: "
)

queries = [
    "Qual é o prazo para interpor recurso administrativo?",
]

documents = [
    "O prazo para interposição de recurso administrativo é de 10 dias úteis...",
    "A segunda via da fatura pode ser solicitada no portal do cliente...",
]

query_embeddings = model.encode(
    queries,
    prompt=query_prompt,
    normalize_embeddings=True,
)

document_embeddings = model.encode(
    documents,
    normalize_embeddings=True,
)

scores = query_embeddings @ document_embeddings.T
print(scores)

Avoid passing both prompt and prompt_name unless you intentionally want one to override the other. A related Qwen3 Embedding discussion notes that explicit prompt takes priority over prompt_name in Sentence Transformers-style usage.


14. Bottom-line recommendation

For this Portuguese QA retrieval use case, I would proceed like this:

  1. Keep your fine-tuned BGE-M3 model as the incumbent baseline.
  2. Evaluate Harrier-270M zero-shot with the correct query instruction and raw documents.
  3. Fine-tune Harrier with MNRL using query-side instruction during training.
  4. Do not prompt documents.
  5. Start with lr=5e-6, one epoch, warmup 0.10, cosine scheduler, bf16, effective batch size around 128.
  6. Run LR ablations at 3e-6, 5e-6, and 1e-5.
  7. Use BatchSamplers.NO_DUPLICATES.
  8. Deduplicate query/document pairs aggressively.
  9. Try CachedMultipleNegativesRankingLoss if memory prevents larger effective batches.
  10. Add hard negatives only after the clean first-stage baseline works.
  11. Compare systems fairly: dense vs dense, hybrid vs hybrid, reranked vs reranked.
  12. Decide based on your own held-out Portuguese QA retrieval set, not only Multilingual MTEB v2.

Final concise answer

  • Supported? Yes, practically. Harrier-270M is a Sentence Transformers model and should work with SentenceTransformerTrainer.
  • Exact public recipe? I have not found an exact Harrier-270M + STTrainer + MNRL + Portuguese QA recipe.
  • Instruction during training? Yes. Apply it to the query/anchor side during training and inference.
  • Documents? Keep documents/passages unprompted.
  • Loss? MultipleNegativesRankingLoss is appropriate for (query, positive passage) pairs.
  • Main risks? Prompt mismatch, false negatives, duplicates, too-high LR, too-small effective batch, and incorrect pooling if using raw AutoModel.
  • Starting hyperparameters? lr=3e-6 to 1e-5, one epoch, warmup 0.10, cosine scheduler, bf16, effective batch 128–256, BatchSamplers.NO_DUPLICATES.
  • Best next experiment? Harrier zero-shot prompted vs Harrier MNRL prompted vs Harrier no-prompt ablation vs your fine-tuned BGE-M3 baseline.