PEFT for Token Classification with Large Language Models

Hi folks. I am attempting to use large language models (specifically Phi-3-mini) as a Token Classifier. This was recently made easy to do with the transformers library thanks to the Phi3ForTokenClassification implementation. I am having difficulty training this model via Parameter Efficient Fine Tuning (PEFT, i.e. LoRa).

I am creating an instance of Phi3ForTokenClassification from the pre-trained Phi-3-mini model as follows:

model =  Phi3ForTokenClassification.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    attn_implementation="flash_attention_2",
    num_labels=len(labels_vocab),
    id2label=id2label,
    label2id=label2id,
    use_cache=False,
    torch_dtype=torch.bfloat16
)

As expected, since the head of this model is getting replaced with a linear layer for predicting the one-hot-encoded token labels, I get the warning that that specific layer has not been trained yet:

Some weights of Phi3ForTokenClassification were not initialized from the model checkpoint at microsoft/Phi-3-mini-4k-instruct and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

At this point. I am assuming that I need to perform a fine-tuning of the core mode layers (i.e. attention heads / mlp, etc.) and a full training of that last classifier layer.

I am training on a GTX 4090 (24GB of NVRAM). As such, I need to leverage PEFT with a LoRa which I configure as follows:

peft_config = LoraConfig(
    r=32,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    task_type="TOKEN_CLS",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    modules_to_save=["classifier"],
    inference_mode=False
)
peft_model = get_peft_model(model, peft_config)

When checking the number of trainable parameters, I get trainable params: 18,000,953 || all params: 3,740,755,058 || trainable%: 0.4812. Which seems right to me.

Based on what I’ve research on modules_to_save. This seems like the right configuration and would result in a full training of the classifier module of the model. When I print the model details, this is the classifier layer: (classifier): Linear(in_features=3072, out_features=57, bias=True).

Since this is LoRa and we’re training new weights, I drafted my training configuration with a fairly aggressive learning rate, as follows:

training_args = TrainingArguments(
    bf16=True,
    output_dir="outputs",
    learning_rate=(2e-4 * 4),
    gradient_accumulation_steps=4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    logging_steps=10,
    logging_strategy="steps",
    eval_strategy="epoch",
    save_strategy="epoch",
    report_to="wandb"
)

And I am training with:

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    # calculates precision, recall, accuracy, and f1
    compute_metrics=compute_metrics,
)

The training seems to run correctly. Every epoch the prevision, recall, accuracy, and f1 scores all seem reasonable and improving. After the 1st epoch, my f1 score is ~0.66, improving to ~0.72 in the 2nd epoch.

Once my short training run is complete. I save the model as follows:

merged_model = trainer.model.merge_and_unload()
merged_model.save_pretrained("model-name")

To test my model, I load it for inference as follows:

model = AutoModelForTokenClassification.from_pretrained("model-name")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

token_classifier = pipeline("ner", model=model, tokenizer=tokenizer)

I get very poor results during inference. I am not completely convinced I am doing this correctly. I made a few assumption above regarding how the training would work that I am not sure are correct.

I rented an A6000Ada to do a full (non-PEFT) training on the same dataset. After 2 epochs, the training had a lower accuracy, precision, recall, and f1 score. But, it performed significantly better when doing test inference.

Does anyone have any suggestion regarding how I can make this better. I am not afraid to go deep dive material. I have a ton to learn and I’m here for it. Thanks in advance!

Does anyone have any insights? Sorry to bump this. Not sure where else to ask.

Hello, did you finally find out what was going on? I’m facing a similar issue. Cheers.