this is my current code to load llama 3.1 8b instruct model into local windows 10 pc, i tried many methods to get it to run on multiple GPUs (in order to increase tokens per second) but without success, the model loads onto the GPU:0 and GPU:1 stay idle, and the generation on average reaches a 12-…

Possibly: [image] CUDA Memory issue for model.generate() in AutoModelForCausalLM 🤗Transformers Hello everyone, I have 4 A100 GPUs and I’m utilizing Mixtral with dtype set as bfloat16 for a text generation task on these GPUs. I’m aware that by using device_map=…

@John6666 thanks for your answer, the suggestion you provided not seems to be the exact one i needed, since it was talking about very large models that couldn’t fit into a single GPU memory, and it wanted to use some left over of a 3thd GPU that was under utilized (more of a sharding problem), and …

I see, so you’re not talking about bugs or anything. You may have already tried this, but how about using “cuda:0” or “cuda:1” or integer (.to(device=0)) instead of “cuda”? With “cuda”, I think “cuda:0” or int(0) (GPU:1) is implicitly used.

that is a good idea, but it means a manual management of the distribution

I think it would be a reinvention of accelerate, but if you’re going to do it manually, why not do it like this? def get_idle_gpu(): import torch import random gpu_num = torch.cuda.device_count() device_num = random.randint(0, gpu_num) # In practice, it returns a GPU that are relati…

can you please explain why you think random would produce idle GPU number (it might supply a number of a busy and loaded GPU, after all - it is random) and i did not mean i want to manage the distribution of model and inference calculations over GPUs manually, i meant that if i were to use your di…

Apologies. The random function is just a sample to fill in the blanks and has no deeper meaning whatsoever.:cold_sweat: I probably don’t understand what you want to do properly, but if you just want to do some automatic and fast distributed processing to the extent that you don’t have to rely on ac…

Why transformers doesn't use Multiple GPUs (to increase tokens per second)?

Beginners

John6666 September 21, 2024, 10:59am 2

Possibly:

Topic		Replies	Views
How to generate with a single gpu when a model is loaded onto multiple gpus? Beginners	0	917	February 9, 2024
Getting error when running inference in multiple GPUs 🤗Transformers	0	695	October 13, 2023
[SOLVED] What's the right way to do GPU paralellism for inference (not training) on AutoModelForCausalLM? 🤗Transformers	1	295	August 26, 2024
If I use llama 70b and 7b for speculative decoding, how should I put them on my multiple gpus in the code 🤗Transformers	0	71	October 11, 2024
Using 3 GPUs for training with Trainer() of transformers 🤗Transformers	2	2420	October 18, 2023

Why transformers doesn't use Multiple GPUs (to increase tokens per second)?

Related topics