Uploading a dataset that doesn't fit in memory to the HF hub

What is the recommended way to build and upload a large dataset that doesn’t fit in memory to the HF hub?

Is there any way of stringing together existing dataset builder / constructor methods and push_to_hub to do this straightforwardly?

Feel free to use load_dataset or Dataset.from_generator to get a Dataset object from your large data source. It writes the data on disk so it can load datasets bigger than memory

Then push_to_hub() uploads file by file of 500MB each by default so that you can upload a dataset that doesn’t fit in memory as well

Thanks for the response!

What if the dataset is also larger than the memory-mapping can handle? ( assuming this is what you mean by writing to disk in the from_generator case)

memory mapping can handle datasets as long as they fit on your disk :wink:

oh cool! thanks