Seems practically supported?
Fine-tuning microsoft/harrier-oss-v1-270m with SentenceTransformerTrainer
Yes — this should be a supported and reasonable setup, with one important caveat: I have not found a public example that exactly combines:
microsoft/harrier-oss-v1-270m
+ SentenceTransformerTrainer
+ MultipleNegativesRankingLoss
+ Portuguese QA retrieval
However, the evidence strongly points to this being a valid path:
microsoft/harrier-oss-v1-270m is packaged as a sentence-transformers model and can be loaded with SentenceTransformer("microsoft/harrier-oss-v1-270m").
- The model card explicitly shows Sentence Transformers usage and encodes queries with a prompt while encoding documents without a prompt.
- The model card says Harrier uses a decoder-only architecture, last-token pooling, and L2-normalized embeddings.
- The model card also says query instructions are how the model is trained and that omitting them can degrade performance; document-side instructions are not needed.
SentenceTransformerTrainingArguments supports training-time prompts, including column-specific prompt mappings.
MultipleNegativesRankingLoss is the standard Sentence Transformers loss for positive (query, document) / (anchor, positive) retrieval pairs.
- Nearby public examples exist, especially a Harrier-family Vietnamese legal retrieval model using Sentence Transformers + MNRL, and SkillRet-style decoder-embedding fine-tuning using query instructions and unprompted documents.
My recommendation is:
training query / anchor: Instruct: ...\nQuery: <Portuguese question>
training document / positive: <raw Portuguese passage>
inference query: Instruct: ...\nQuery: <Portuguese question>
inference document: <raw Portuguese passage>
So: apply the instruction prefix to the query/anchor side during training, not only at inference. Keep documents/passages unprompted.
1. Should the instruction be applied during training?
Yes. For Harrier, the instruction is not just an inference-time decoration. It is part of the expected query-side input format.
The Harrier-270M model card shows this Sentence Transformers pattern:
query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)
The same model card also shows the raw Transformers pattern:
def get_detailed_instruct(task_description: str, query: str) -> str:
return f"Instruct: {task_description}\nQuery: {query}"
# Each query must come with a one-sentence instruction that describes the task.
# No need to add instruction for retrieval documents.
The FAQ is especially relevant: it says query instructions are how the model is trained, omitting them causes degradation, and document-side instructions are not needed.
So for Portuguese QA retrieval, I would train with:
query / anchor:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: <question>
document / positive:
<passage>
and infer with exactly the same policy:
query:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: <question>
document:
<passage>
This avoids a train/inference mismatch.
Bad pattern
training query:
Qual é o prazo para interpor recurso administrativo?
inference query:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: Qual é o prazo para interpor recurso administrativo?
This fine-tunes the model on raw questions but deploys it on prompted questions. For Harrier, that is probably the wrong distribution.
Better pattern
training query:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: Qual é o prazo para interpor recurso administrativo?
inference query:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: Qual é o prazo para interpor recurso administrativo?
Documents should remain raw in both training and inference:
training document:
O prazo para interposição de recurso administrativo é de 10 dias úteis...
indexed document:
O prazo para interposição de recurso administrativo é de 10 dias úteis...
2. How to apply query-only prompts in SentenceTransformerTrainer
If your dataset columns are named query and document, use column-specific prompts:
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
losses,
)
from sentence_transformers.training_args import BatchSamplers
model = SentenceTransformer(
"microsoft/harrier-oss-v1-270m",
model_kwargs={"dtype": "auto"},
)
query_prompt = (
"Instruct: Given a Portuguese question, "
"retrieve relevant Portuguese passages that answer the question\n"
"Query: "
)
args = SentenceTransformerTrainingArguments(
output_dir="harrier-270m-pt-qa-mnrl",
per_device_train_batch_size=8,
gradient_accumulation_steps=16, # effective batch size 128 on 1 GPU
learning_rate=5e-6,
num_train_epochs=1,
warmup_ratio=0.10,
lr_scheduler_type="cosine",
bf16=True,
gradient_checkpointing=True,
batch_sampler=BatchSamplers.NO_DUPLICATES,
prompts={
"query": query_prompt,
"document": "",
},
logging_steps=50,
save_strategy="steps",
save_steps=500,
save_total_limit=2,
)
loss = losses.MultipleNegativesRankingLoss(
model,
directions=("query_to_doc",),
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset, # columns: query, document
loss=loss,
)
trainer.train()
trainer.save_model("harrier-270m-pt-qa-mnrl/final")
If your dataset columns are named anchor and positive, change only the prompt mapping:
prompts={
"anchor": query_prompt,
"positive": "",
}
The important rule is simple:
query-like column: prompt
document-like column: no prompt
3. Should the prompt be in English or Portuguese?
I would start with the English instruction format because it matches Harrier’s public prompt style:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query:
The query and document texts themselves should remain Portuguese.
After you have a baseline, test a Portuguese instruction as an ablation:
Instruct: Dada uma pergunta em português, recupere passagens em português relevantes que respondam à pergunta
Query:
I would not start by translating the structural markers to Instrução: and Consulta:. Keep Instruct: and Query: first, because that matches the Harrier format shown in the model card and config_sentence_transformers.json.
Recommended first prompt:
query_prompt = (
"Instruct: Given a Portuguese question, "
"retrieve relevant Portuguese passages that answer the question\n"
"Query: "
)
4. Is MultipleNegativesRankingLoss appropriate?
Yes. For QA retrieval, your data usually has the form:
query: Portuguese question
positive: passage that answers the question
That is a natural fit for MultipleNegativesRankingLoss, which is designed for positive pairs such as (query, response) or (anchor, positive).
A basic version is:
loss = losses.MultipleNegativesRankingLoss(model)
For clarity in retrieval, I would write:
loss = losses.MultipleNegativesRankingLoss(
model,
directions=("query_to_doc",),
)
This trains the model so that each query is closer to its matching passage than to other passages in the batch.
5. Main MNRL caveat: false negatives
MNRL uses other positives in the batch as negatives. That is efficient, but it can be harmful if some of those “negatives” are actually relevant.
Example:
query:
Como solicitar a segunda via da fatura?
positive A:
A segunda via da fatura pode ser solicitada no portal do cliente.
positive B for another query:
Para emitir uma cópia da fatura, acesse Minha Conta e clique em Segunda Via.
For the first query, positive B is not really negative. It probably answers the same question. If it appears in the same batch, MNRL may incorrectly push it away.
This is common in QA retrieval, FAQ retrieval, legal retrieval, policy retrieval, support retrieval, and any corpus with repeated answer templates.
Use:
batch_sampler=BatchSamplers.NO_DUPLICATES
The Sentence Transformers training overview specifically notes that losses using in-batch negatives benefit from no duplicate samples in a batch. The loss docs also discuss cached / larger-batch variants of MNRL.
Also deduplicate aggressively before training:
- exact duplicate passages;
- near-duplicate chunks;
- boilerplate-heavy passages;
- repeated FAQ answers;
- multiple chunks from the same source document;
- multiple positives that answer the same query.
6. Decoder-only Harrier vs encoder-style BGE-M3
Harrier and BGE-M3 should not be treated as interchangeable SBERT-style encoders.
Harrier-specific considerations
microsoft/harrier-oss-v1-270m is:
- decoder-only;
- multilingual;
- 270M parameters;
- 640-dimensional embeddings;
- up to 32,768 tokens;
- last-token pooled;
- L2-normalized;
- instruction-sensitive on the query side.
When used through SentenceTransformer, last-token pooling and normalization are handled automatically.
If using raw AutoModel, you must reproduce the model-card pooling behavior yourself:
def last_token_pool(last_hidden_states, attention_mask):
left_padding = attention_mask[:, -1].sum() == attention_mask.shape[0]
if left_padding:
return last_hidden_states[:, -1]
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size), sequence_lengths]
For this use case, I would stay with SentenceTransformer unless there is a strong reason not to.
BGE-M3-specific considerations
BAAI/bge-m3 is not just a dense embedding model. Its model card describes it as multi-functional, multilingual, and multi-granular:
- dense retrieval;
- sparse retrieval;
- multi-vector retrieval;
- more than 100 languages;
- up to 8192 tokens.
This matters for a fair comparison. Do not compare:
BGE-M3 hybrid/sparse/multi-vector system
vs
Harrier dense-only system
and call that a model-only comparison.
Fairer comparisons are:
BGE-M3 dense vs Harrier dense
BGE-M3 hybrid vs Harrier dense + BM25
BGE-M3 + reranker vs Harrier + reranker
7. Recommended starting hyperparameters
For a first full-model Harrier-270M MNRL run, I would start conservatively.
| Parameter |
Recommended first value |
| Base model |
microsoft/harrier-oss-v1-270m |
| Loss |
MultipleNegativesRankingLoss |
| Direction |
("query_to_doc",) |
| Query prompt |
yes |
| Document prompt |
no |
| Learning rate |
5e-6 |
| LR candidates |
3e-6, 5e-6, 1e-5 |
| Epochs |
1 |
| Warmup ratio |
0.10 |
| Scheduler |
cosine |
| Precision |
bf16 if supported |
| Physical batch size |
4–16, depending on GPU |
| Effective batch size |
128–256 |
| Batch sampler |
BatchSamplers.NO_DUPLICATES |
| Gradient checkpointing |
yes if memory-bound |
| Max sequence length |
512 or 1024 first |
I would not start with 5e-5 for full-model MNRL fine-tuning. Harrier is already a strong embedding model; the goal is adaptation, not overwriting its embedding geometry.
A useful nearby reference is mainguyen9/vietlegal-harrier-0.6b, a Harrier-family Vietnamese legal retrieval model that reports Sentence Transformers training, MNRL, hard-negative mining, LR 3e-6, batch size 256, one epoch, warmup 10%, cosine scheduler, and bf16. It is not the same model size or language, but it is a closer reference than generic BERT/SBERT defaults.
8. Should you use CachedMultipleNegativesRankingLoss?
Use it if your GPU memory prevents a useful effective batch size.
MNRL benefits from larger batches because larger batches provide more in-batch negatives. If normal MNRL is memory-bound, try:
loss = losses.CachedMultipleNegativesRankingLoss(
model,
mini_batch_size=32,
)
Then test effective batch sizes like:
256
512
1024
But I would not make cached MNRL the first experiment. First establish that the simple MNRL setup works.
9. Suggested experiment matrix
Run these in order.
| Run |
Model |
Training |
Query prompt |
Doc prompt |
LR |
Effective batch |
Purpose |
| A |
BGE-M3 |
existing fine-tune |
current |
current |
current |
current |
incumbent baseline |
| B |
Harrier-270M |
none |
yes |
no |
— |
— |
zero-shot baseline |
| C |
Harrier-270M |
MNRL |
yes |
no |
5e-6 |
128 |
main first run |
| D |
Harrier-270M |
MNRL |
yes |
no |
3e-6 |
128–256 |
lower-LR check |
| E |
Harrier-270M |
MNRL |
yes |
no |
1e-5 |
128–256 |
upper-LR check |
| F |
Harrier-270M |
MNRL |
no |
no |
5e-6 |
128 |
prompt ablation |
| G |
Harrier-270M |
Cached MNRL |
yes |
no |
5e-6 |
256–1024 |
batch-size check |
| H |
Harrier-270M |
hard-negative stage |
yes |
no |
3e-6–5e-6 |
task-dependent |
ranking refinement |
The most important comparison is:
fine-tuned BGE-M3
vs
zero-shot Harrier with query instruction
vs
fine-tuned Harrier with query instruction
vs
fine-tuned Harrier without query instruction
Leaderboard scores are useful for model shortlisting, but the final decision should be based on your own Portuguese QA retrieval benchmark.
For broader benchmark context, see MTEB, MMTEB, and the original MTEB paper. MTEB-style scores are useful, but they do not replace task-specific evaluation.
10. Evaluation metrics
Use the same evaluation pipeline for BGE-M3 and Harrier.
Minimum retrieval metrics:
nDCG@10
MRR@10
Recall@5
Recall@10
Recall@50
Recall@100
Why these metrics matter:
| Metric |
What it tells you |
Recall@50 / Recall@100 |
Whether the retriever can put the right passage somewhere in the candidate pool |
Recall@5 / Recall@10 |
Whether the retriever is good enough for direct RAG context selection |
MRR@10 |
Whether the first relevant passage appears early |
nDCG@10 |
Ranking quality when there are multiple relevant passages |
Also track operational metrics:
embedding throughput
query latency
index size
GPU memory
embedding dimension
chunk length
max sequence length
For Portuguese-specific external sanity checks, useful resources include:
11. Hard negatives: useful, but second stage
Do not start with hard negatives. Start with clean query-positive MNRL.
After the first baseline is stable:
1. Embed the full corpus.
2. Retrieve top 100 candidates per training query.
3. Remove known positives.
4. Skip the top few candidates if they may be unlabeled positives.
5. Sample negatives from ranks 20–100 or 50–100.
6. Train a second stage with explicit negatives or a hard-negative-aware setup.
The reason to avoid the top retrieved “negative” is that it may actually be a valid answer that was not labeled.
The SkillRet paper is a useful related reference. It fine-tunes decoder-style embedding models using MultipleNegativesRankingLoss, applies the same task-specific query instruction to anchor queries during training, uses no document prompt for Harrier/Qwen-style embedding models, and mines hard negatives for the reranker stage. It also reports that fine-tuning Harrier-OSS-0.6B and Qwen3-Embedding-0.6B gives nearly identical performance in that task, suggesting that the training recipe matters at least as much as the exact decoder-embedding base.
12. Common pitfalls
Pitfall 1: double prompting
Bad:
dataset query already contains:
Instruct: ...
Query: ...
and TrainingArguments also uses:
prompts={"query": "Instruct: ...\nQuery: "}
This produces:
Instruct: ...
Query: Instruct: ...
Query: <question>
Use one method:
Either store raw queries and use prompts=...
or store prompted queries and do not use prompts=...
I recommend storing raw queries and using prompts=....
Pitfall 2: prompting documents
Bad:
document:
Instruct: Given a Portuguese question, retrieve relevant Portuguese passages that answer the question
Query: <passage>
For Harrier retrieval, documents should be raw passages.
Pitfall 3: train/inference mismatch
Bad:
training query: raw question
inference query: prompted question
Better:
training query: prompted question
inference query: prompted question
Pitfall 4: comparing systems unfairly
Bad:
BGE-M3 hybrid vs Harrier dense-only
Better:
BGE-M3 dense vs Harrier dense
BGE-M3 hybrid vs Harrier dense + BM25
BGE-M3 + reranker vs Harrier + reranker
Pitfall 5: starting with too much context
Harrier supports long context, but that does not mean a first fine-tune should use 32k tokens.
Start with:
512 or 1024 tokens
Then test:
2048
4096
8192
only if your evaluation set shows that longer passages help.
In retrieval, better chunking is often more useful than simply increasing max length.
13. Inference after fine-tuning
Use the same query prompt and raw documents:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("harrier-270m-pt-qa-mnrl/final")
query_prompt = (
"Instruct: Given a Portuguese question, "
"retrieve relevant Portuguese passages that answer the question\n"
"Query: "
)
queries = [
"Qual é o prazo para interpor recurso administrativo?",
]
documents = [
"O prazo para interposição de recurso administrativo é de 10 dias úteis...",
"A segunda via da fatura pode ser solicitada no portal do cliente...",
]
query_embeddings = model.encode(
queries,
prompt=query_prompt,
normalize_embeddings=True,
)
document_embeddings = model.encode(
documents,
normalize_embeddings=True,
)
scores = query_embeddings @ document_embeddings.T
print(scores)
Avoid passing both prompt and prompt_name unless you intentionally want one to override the other. A related Qwen3 Embedding discussion notes that explicit prompt takes priority over prompt_name in Sentence Transformers-style usage.
14. Bottom-line recommendation
For this Portuguese QA retrieval use case, I would proceed like this:
- Keep your fine-tuned BGE-M3 model as the incumbent baseline.
- Evaluate Harrier-270M zero-shot with the correct query instruction and raw documents.
- Fine-tune Harrier with MNRL using query-side instruction during training.
- Do not prompt documents.
- Start with
lr=5e-6, one epoch, warmup 0.10, cosine scheduler, bf16, effective batch size around 128.
- Run LR ablations at
3e-6, 5e-6, and 1e-5.
- Use
BatchSamplers.NO_DUPLICATES.
- Deduplicate query/document pairs aggressively.
- Try
CachedMultipleNegativesRankingLoss if memory prevents larger effective batches.
- Add hard negatives only after the clean first-stage baseline works.
- Compare systems fairly: dense vs dense, hybrid vs hybrid, reranked vs reranked.
- Decide based on your own held-out Portuguese QA retrieval set, not only Multilingual MTEB v2.
Final concise answer
- Supported? Yes, practically. Harrier-270M is a Sentence Transformers model and should work with
SentenceTransformerTrainer.
- Exact public recipe? I have not found an exact Harrier-270M + STTrainer + MNRL + Portuguese QA recipe.
- Instruction during training? Yes. Apply it to the query/anchor side during training and inference.
- Documents? Keep documents/passages unprompted.
- Loss?
MultipleNegativesRankingLoss is appropriate for (query, positive passage) pairs.
- Main risks? Prompt mismatch, false negatives, duplicates, too-high LR, too-small effective batch, and incorrect pooling if using raw
AutoModel.
- Starting hyperparameters?
lr=3e-6 to 1e-5, one epoch, warmup 0.10, cosine scheduler, bf16, effective batch 128–256, BatchSamplers.NO_DUPLICATES.
- Best next experiment? Harrier zero-shot prompted vs Harrier MNRL prompted vs Harrier no-prompt ablation vs your fine-tuned BGE-M3 baseline.