I am using huggingface’s Trainer with --bfl16 flag enabled and deepspeed enabled. However, I want to force float32 for a specific layer. How to do it?
Hmm… Mixed precision training?
opened 02:19AM - 08 Feb 24 UTC
closed 08:04AM - 05 Apr 24 UTC
Normally when saving a model using the Trainer class, the `dtype` of the saved m… odel is (and should be) the same as the original model. This is also true when using mixed precision, and when using DeepSpeed. However, when using mixed precision **together** with DeepSpeed, the output is in float16 no matter the model input `dtype`.
The Trainer class has custom handling for DeepSpeed, depending on the ZeRO stage:
https://github.com/huggingface/transformers/blob/5f9685576149fb45a61d0dcec9a260930df0a49a/src/transformers/trainer.py#L2914-L2928
as well does accelerate:
https://github.com/huggingface/accelerate/blob/06b138d84537ffb2d1d404f2f198a0446e8d7ec3/src/accelerate/accelerator.py#L3042-L3056
For ZeRO stage <=2 DeepSpeed holds the model weights in the `state_dict`. Using mixed precision training, these are always in float16. Using full precision training, they are the same dtype as the original model.
For ZeRO stage 3 the `state_dict` contains just placeholders since the model weights are partitioned. By setting `stage3_gather_16bit_weights_on_model_save=true`, DeepSpeed consolidates the weights. When training using mixed precision, float16 is always produced. When training in full precision, despite the name, it follows the dtype of the original model. If `stage3_gather_16bit_weights_on_model_save=false`, Trainer saves a full checkpoint instead, and the DeepSpeed `zero_to_fp32.py` script can be used to recover weights in float32.
Currently, the only way to save a model that is trained using the Trainer class that applies mixed precision along with DeepSpeed ZeRO stage <=2 in float32, is to manually save a checkpoint and then use some weight recovery method afterwards. Is this due to a limitation of the DeepSpeed API, or could this be handled in the Trainer class (or preferably in Accelerate)? At least, maybe a flag could be available to either save the float16 weights or a checkpoint at the end of training (kind of how stage 3 with `stage3_gather_16bit_weights_on_model_save=true` is handled)?
### Who can help
@pacman100, @muellerzr
Or similar to this issue…?
I saw these lines:
bnb_4bit_compute_dtype=torch.float16,
...
optim = "paged_adamw_32bit"
...
for name, module in trainer.model.named_modules():
if "norm" in name:
module = module.to(torch.float32)
in the falcon tutorial. These are confusing me. Based on the original QLoRA tutorial, they use 4 bit model + during training they use 16 brain float (not normal float nor float 32). See equation 5:
YBF16 = XBF16doubleDequant(c_FP32_1, c_k-bit_2, WNF4) + XBF16L_BF16_1 L_BF16_2
doubleDequa…