Model responses are random ignoring my dataset

DigioMatthy · July 28, 2025, 9:12am

I am using AutoTrain to finetune my Llama model with my custom data and the model give random responses ignoring my dataset. The thing is that on my dataset I have 145 rows in JSONL and when I start the fine-tuning with this dataset and I analyze logs I can see these rows:

So the dataset is recognized with 145 rows so from here I can understand that my dataset is well-structured and every row is a valid JSON object.
But right after the model shards are uploaded, it gives me this log:

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 9 examples [00:00, ? examples/s]

So my question is: Why does it log Generating train split 0 examples and Generating train split 9 examples right below?
Is this a normal behaviour of AutoTrain?
Or there’s something that I have to adjust on my training dataset?
After the model is finetuned, obviously I can see it on my HuggingFace hub and I can also see the training statistics on TensorBoard but I see only a dot on the graphs and the training loss about 5.4, so yeah, everytime I try to ask him something about my dataset or anything else, he answers me randomly.
What can I do in order to finetune a model in the right way? Maybe I just have to expand my dataset because 145 rows are not enough and those logs are just normal?

John6666 · July 28, 2025, 10:01am

Why does it log Generating train split 0 examples and Generating train split 9 examples right below?

This error seems to occur when Column Mapping is not set correctly.

DigioMatthy · July 28, 2025, 10:18am

My dataset is a jsonl format and has only one column ‘text’.
In AutoTrain I set the Column Mapping like this:

And the chat template parameter is set to None

John6666 · July 28, 2025, 10:23am

It appears to be correct… Another possible factor is that packing is enabled with the small dataset.
Also, unless there is a specific reason, I think it’s safer to leave Chat Template on automatic.

DigioMatthy · July 28, 2025, 10:36am

Following the general documentation on the Column Mapping in AutoTrain topic I tried to set the Column Mapping like this:

And it gives me error KeyError {“text”: “text”} is invalid. (even if I’m using SFT)

So now looking at the discussion they are talking about disabling the parameter packing but the thing is that even if I enable full parameter mode there is no packing parameter, anyway I’m using basic parameter mode because otherwise I don’t know what to tweak.
Maybe do I have to write manually parameters activating JSON parameters first and doing so I can write like packing=false and try with other parameters?
Or maybe it’s just my dataset too small and I have to expand it?

John6666 · July 28, 2025, 1:03pm

There is no doubt that the dataset is too small, but I don’t think it’s absolutely impossible with that amount of data…

If there is a publicly available dataset that can reproduce the symptoms, it would be possible to investigate…

If there are no settings for packing, it will be difficult with SFT with small dataset…

DigioMatthy · July 28, 2025, 1:22pm

Ok it was predictable that the dataset was too small for a real fine-tuning actually, I’ll create a bigger one and I’ll try launch a finetuning and we’ll see if I will have the same problem, but I don’t think so .
Last question, what do you think the minimal amount of examples a dataset should have in order to make a really good and successful fine-tuning?

DigioMatthy · July 28, 2025, 1:26pm

Ah I forgot to say, maybe the issue could be that AutoTrain GUI doesn’t permit to set a value to a packing parameter because behind it’s a default set and it can’t be handled, so if someone wants to train their own model, the dataset has to be large

John6666 · July 28, 2025, 1:49pm

Hmm, I think you should ask someone who knows more about LLM fine-tuning than I do, but what I sometimes hear is that “500 to 1000 samples are sufficient for LoRA”, “data diversity is more important than quantity”, etc.

Since it is difficult to manually create a dataset from scratch, many people choose to use existing AI tools to create dataset. Also, the online documents like this may be useful references regarding formatting.

John6666 · July 28, 2025, 1:55pm

There are people who know more about AI than I do who say things like, “Ask AI about AI.” Commercial AI systems like Gemini and ChatGPT have been trained on a lot of AI-related information, so when you ask them about AI itself, they often provide fairly reliable answers. Since they have a solid foundation of knowledge, even just enabling search can help you gather reasonably up-to-date information.

DigioMatthy · July 28, 2025, 1:55pm

Ok, I think these documentations you pinged me are enough to solve the dataset problem.
Thank you so much for your time and support!!

DigioMatthy · July 28, 2025, 1:56pm

Wow, didn’t know that. Ok will try it then! Ty!!

Topic		Replies	Views
My dataset is 250 docs, is multiple hours tuning normal? Beginners	17	421	January 29, 2025
Autotrain LLM fine tuning data mapping problem 🤗AutoTrain	0	530	July 5, 2023
Train huggingface Beginners	2	445	November 10, 2023
How to fine-tune an LLM with AutoTrain? 🤗AutoTrain	5	2964	March 3, 2024
Training data is not working Beginners	4	281	November 18, 2024

Model responses are random ignoring my dataset

Related topics