I am using AutoTrain to finetune my Llama model with my custom data and the model give random responses ignoring my dataset. The thing is that on my dataset I have 145 rows in JSONL and when I start the fine-tuning with this dataset and I analyze logs I can see these rows:

So the dataset is recognized with 145 rows so from here I can understand that my dataset is well-structured and every row is a valid JSON object.
But right after the model shards are uploaded, it gives me this log:
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 9 examples [00:00, ? examples/s]
So my question is: Why does it log Generating train split 0 examples and Generating train split 9 examples right below?
Is this a normal behaviour of AutoTrain?
Or thereâs something that I have to adjust on my training dataset?
After the model is finetuned, obviously I can see it on my HuggingFace hub and I can also see the training statistics on TensorBoard but I see only a dot on the graphs and the training loss about 5.4, so yeah, everytime I try to ask him something about my dataset or anything else, he answers me randomly.
What can I do in order to finetune a model in the right way? Maybe I just have to expand my dataset because 145 rows are not enough and those logs are just normal?
Why does it log Generating train split 0 examples and Generating train split 9 examples right below?
This error seems to occur when Column Mapping is not set correctly.
My dataset is a jsonl format and has only one column âtextâ.
In AutoTrain I set the Column Mapping like this:
And the chat template parameter is set to None
It appears to be correct⌠Another possible factor is that packing is enabled with the small dataset.
Also, unless there is a specific reason, I think itâs safer to leave Chat Template on automatic.
Following the general documentation on the Column Mapping in AutoTrain topic I tried to set the Column Mapping like this:
And it gives me error KeyError {âtextâ: âtextâ} is invalid. (even if Iâm using SFT)
So now looking at the discussion they are talking about disabling the parameter packing but the thing is that even if I enable full parameter mode there is no packing parameter, anyway Iâm using basic parameter mode because otherwise I donât know what to tweak.
Maybe do I have to write manually parameters activating JSON parameters first and doing so I can write like packing=false and try with other parameters?
Or maybe itâs just my dataset too small and I have to expand it?
There is no doubt that the dataset is too small, but I donât think itâs absolutely impossible with that amount of dataâŚ
If there is a publicly available dataset that can reproduce the symptoms, it would be possible to investigateâŚ
If there are no settings for packing, it will be difficult with SFT with small datasetâŚ
Ok it was predictable that the dataset was too small for a real fine-tuning actually, Iâll create a bigger one and Iâll try launch a finetuning and weâll see if I will have the same problem, but I donât think so
.
Last question, what do you think the minimal amount of examples a dataset should have in order to make a really good and successful fine-tuning?
Ah I forgot to say, maybe the issue could be that AutoTrain GUI doesnât permit to set a value to a packing parameter because behind itâs a default set and it canât be handled, so if someone wants to train their own model, the dataset has to be large
Hmm, I think you should ask someone who knows more about LLM fine-tuning than I do, but what I sometimes hear is that â500 to 1000 samples are sufficient for LoRAâ, âdata diversity is more important than quantityâ, etc.
Since it is difficult to manually create a dataset from scratch, many people choose to use existing AI tools to create dataset. Also, the online documents like this may be useful references regarding formatting.
There are people who know more about AI than I do who say things like, âAsk AI about AI.â Commercial AI systems like Gemini and ChatGPT have been trained on a lot of AI-related information, so when you ask them about AI itself, they often provide fairly reliable answers. Since they have a solid foundation of knowledge, even just enabling search can help you gather reasonably up-to-date information.
Ok, I think these documentations you pinged me are enough to solve the dataset problem.
Thank you so much for your time and support!! 
Wow, didnât know that. Ok will try it then! Ty!!
