Model's evaluation in DDP training is using only one GPU

sararb · January 10, 2023, 12:49pm

Hello!
I am training an HF model with torch DDP using the following command line:

python -m torch.distributed.launch --nproc_per_node 2 my_script.py --{arguments}

I noticed that while training was using the two available GPUs, the evaluation step was done only on a single GPU. After checking the source code, it seems that > here the model is not wrapped inside the DDP when training==False.

Is it expected that only one GPU will be used during the evaluation step? If yes, could you explain why the DDP cannot be used for the evaluation as well?

zuzannad1 · September 14, 2023, 9:02am

Have you figured out how to use multiple GPUs for the eval loop during training? Am facing the same issue.

Topic		Replies	Views
Can't use multi GPU in evaluation from Trainer 🤗Transformers	3	1078	December 6, 2023
Which data parallel does trainer use? DP or DDP? 🤗Transformers	6	6647	August 30, 2025
Custom trainer evaluation function Intermediate	0	2904	June 20, 2022
Trainer is not using multiple GPUs in the DP setup Beginners	0	857	April 9, 2023
Model not copied to multiple GPUs when using DDP (using trainer) 🤗Accelerate	2	737	February 5, 2024

Model's evaluation in DDP training is using only one GPU

Related topics