hi,
I fine-tune the bert on NER task, and huggingface add a linear classifier on the top of model. I want to know more details about classifier architecture. e.g. fully connected + softmax…
Hi! Can you be a little bit more specific about your query?
Just to give you a head start,
In general, NER is a sequence labeling (a.k.a token classification) problem.
The additional stuff you may have to consider for NER is, for a word that is divided into multiple tokens by bpe or sentencepiece like model, you use the first token as your reference token that you want to predict. Since all the tokens are connected via self-attention you won’t have problem not predicting the rest of the bpe tokens of a word. In PyTorch, you can ignore computing loss (see ignore_index argument) of those tokens by providing -100 as a label to those tokens (life is so easy with pytorch ).
Apart from that, I didn’t find any more additional complexity in the training NER model.
Some other implementation details you need to check,
One important Note: So far I remember (please verify), In conll, german or dutch dataset there are 2-3 long sentences in the test dataset. Sequence labeling doesn’t work like sentiment analysis. You need to make sure your sentence is not cut down by the max_sequence_len argument of the Language Model’s tokenizer. Otherwise, you will see a little bit of discrepency in your test F1 sore. An easier hack for this problem is to divide the sentence into smaller parts and predict them one by one and finally merge them.
Imo Self-attention and CRF layer is theoretically different but in application some of the problem that CRF solved in prior model, self-attention can also solve them (because they create a fully connected graph). So using softmax is more preferable than a CRF layer.
The score that the original BERT paper reported are not reproducible and comparable with most of the papers since they used document level NER fine-tuning.
If you still have query about the architecture you can follow this,
you only have to replace hierarchical rnn with transformer as the encoder.
You can check the following paper’s for more info,
hi,
can i think self.classifier = nn.Linear(config.hidden_size, config.num_labels) as a fully-connected layer.
input dimension is config.hidden_size and out dinension is config.num_labels. as shown