Since PDFs come in a wide range of formats—from relatively well-structured documents to those that are almost entirely images—I believe OCR is the biggest challenge when creating a dataset based on PDFs. In any case, we can’t expect the same level of precision as with LaTeX. Furthermore, if we have an LLM generate the “Answer” part of Q&A pairs, the resulting data may be of questionable quality for a mathematics dataset.
There are likely to be many hurdles:
There are two related but different tasks here:
- extracting existing problems from your PDFs, and
- generating new synthetic problems inspired by those extracted examples.
I would keep those stages separate. For math data, a simple pipeline like this is probably unsafe:
PDF → LLM → many problem-answer pairs → fine-tuning
The risk is not just bad wording. The risk is that OCR may corrupt formulas, problem boundaries may be wrong, answers may be mathematically false, solutions may contain flawed reasoning, and the final dataset may become hard to audit.
A safer design is closer to this:
PDFs / scanned pages
↓
OCR + layout + formula extraction
↓
seed problem extraction
↓
topic / difficulty / problem-type classification
↓
synthetic generation
↓
answer generation
↓
symbolic / numeric / code-based verification
↓
human review for uncertain examples
↓
JSONL / Parquet / Hugging Face Dataset
In short: use LLMs as part of the pipeline, but do not use an LLM alone as the final source of mathematical truth.
1. First separate “extracted data” from “synthetic data”
I would maintain at least two dataset layers.
Extracted seed dataset
This contains problems actually extracted from the PDFs.
Example:
{
"id": "seed_doc001_p012_q03",
"source_file": "algebra_notes.pdf",
"source_page": 12,
"bbox": [120, 240, 980, 420],
"problem": "Solve 2x + 3 = 11.",
"raw_ocr": "...",
"topic": "algebra",
"subtopic": "linear_equations",
"difficulty": "easy",
"needs_review": true
}
Synthetic dataset
This contains new problems derived from the extracted seed patterns.
Example:
{
"id": "synthetic_linear_equation_000001",
"problem": "Solve 5x - 7 = 18.",
"answer": "x = 5",
"solution": "Add 7 to both sides to get 5x = 25. Divide by 5 to get x = 5.",
"topic": "algebra",
"subtopic": "linear_equations",
"difficulty": "easy",
"synthetic": true,
"source_basis": {
"source_file": "algebra_notes.pdf",
"source_page": 12,
"source_problem_id": "seed_doc001_p012_q03"
},
"generation": {
"method": "template_parameterized",
"generator_version": "linear_equation_v1"
},
"verification": {
"method": "python_symbolic_check",
"status": "passed"
},
"review": {
"status": "approved"
}
}
This separation matters because extracted examples, generated examples, verified examples, and training-format examples are not the same thing.
2. Approach A: template-based generation
This is usually the safest first approach for math.
Start from a seed problem:
Solve 2x + 3 = 11.
Infer a template:
Solve ax + b = c.
Then generate variants by changing parameters:
Solve 5x - 7 = 18.
Solve 3x + 4 = 22.
Solve 9x - 2 = 43.
The key advantage is that the answer can be produced by code instead of guessed by an LLM.
import random
def make_linear_equation():
x = random.randint(-20, 20)
a = random.choice([i for i in range(-10, 11) if i != 0])
b = random.randint(-30, 30)
c = a * x + b
return {
"problem": f"Solve {a}x + {b} = {c}.",
"answer": f"x = {x}",
"solution": (
f"Subtract {b} from both sides to get {a}x = {a*x}. "
f"Then divide by {a}, so x = {x}."
),
"topic": "algebra",
"subtopic": "linear_equations",
"synthetic": True,
"verified": True,
"generation_method": "template_parameterized"
}
This works well for many controlled problem types:
- arithmetic
- linear equations
- polynomial expansion
- factoring
- derivatives
- matrix operations
- simple probability/statistics
- many linear algebra exercises
Relevant examples:
- TemplateMath GitHub
- TemplateGSM dataset on Hugging Face
- Training and Evaluating Language Models with Template-based Data Generation
- TemplateMath project page
- Hugging Face blog: TemplateGSM
If I were starting this project, I would begin with this approach.
3. Approach B: Evol-Instruct style generation
This approach uses an LLM to transform seed problems into more diverse, harder, or more natural variants.
Seed:
Solve 2x + 3 = 11.
Possible variants:
A number is doubled and then increased by 3. The result is 11. Find the number.
Solve 2(3x - 1) + 5 = 21.
A student tried to solve 2x + 3 = 11 but made a mistake. Identify and correct the mistake.
This is useful when you want:
- word problems
- natural language variation
- multi-step reasoning
- harder variants
- curriculum-like progression
- explanation-style examples
But this is riskier than template generation. The LLM may accidentally change the mathematical structure, introduce contradictions, or generate a wrong solution.
Relevant examples:
- WizardMath paper: Empowering Mathematical Reasoning via Reinforced Evol-Instruct
- WizardMath project page
- WizardMath on OpenReview
- WizardMath model on Hugging Face
- MetaMath paper: Bootstrap Your Own Mathematical Questions
- MetaMathQA dataset
I would use this only with strong filtering, verification, and sampling-based human review.
4. Approach C: code-based verified generation
This is the strongest approach if the goal is a serious math fine-tuning dataset.
Instead of asking an LLM to directly write many problem-answer pairs, use the PDFs to discover common problem patterns, then build code generators for those patterns.
Example idea:
def generate_eigenvalue_problem():
# 1. sample a matrix with controlled properties
# 2. compute eigenvalues using sympy
# 3. generate a problem statement
# 4. generate a verified answer and solution
# 5. optionally use an LLM to polish the wording
pass
This gives you synthetic problems whose answers are derived from code. The LLM can still help with wording and explanations, but the mathematical answer should be independently checked.
Relevant examples:
- Synthesizing Verified Mathematical Problems
- OpenMathInstruct-1 paper
- OpenMathInstruct-1 dataset
- NVIDIA OpenMath collection
- OpenMath-CodeLlama model
OpenMathInstruct-1 is especially relevant because it uses synthetic code-interpreter-style solutions for math instruction tuning. That is a good design signal: for math, it is often better to involve executable code than to rely on plain natural-language reasoning alone.
5. PDF / OCR / formula extraction
Before generating synthetic data, the PDF extraction step needs attention.
For normal text PDFs, basic extraction may be enough. For scanned pages, math-heavy documents, tables, diagrams, and formulas, ordinary OCR may fail.
Useful tools:
- Mathpix — strong option for STEM OCR, math, tables, LaTeX, Markdown, and PDF conversion.
- Mathpix API docs
- Mathpix PDF data extraction
- Mathpix PDF to Markdown
- Mathpix Markdown
- Docling — document conversion with layout, reading order, OCR, tables, formulas, Markdown/JSON output.
- Docling docs
- Docling GitHub
- Docling technical report
- Docling code/formula extraction example
- Marker GitHub
- Unstructured GitHub
- Unstructured partitioning docs
- Nougat paper
- Nougat GitHub
- GROBID docs
For scanned math worksheets or formula-heavy PDFs, I would test Mathpix first. For open-source or local processing, I would test Docling, Marker, Nougat, or Unstructured depending on the PDF type.
Do not store only the cleaned text. Preserve traceability:
{
"source_file": "linear_algebra_notes_03.pdf",
"page": 14,
"bbox": [120, 240, 980, 420],
"raw_ocr": "...",
"problem_text": "...",
"formula_latex": "...",
"extraction_confidence": 0.82,
"needs_review": true
}
This makes later debugging much easier.
6. Verification is the central problem
For math, the main question is not “Can an LLM generate answers?” The main question is “How do we reject wrong answers?”
A wrong answer in a math fine-tuning dataset is especially damaging because the model may learn incorrect reasoning patterns.
Use programmatic verification whenever possible.
Useful tools and techniques:
- SymPy
- SymPy docs
- Python numeric checks
- code execution
- program-of-thought / tool-integrated solving
- multiple independent solution methods
- LLM-as-critic as a secondary filter, not the only verifier
- human review for proofs, geometry, diagrams, and ambiguous word problems
Example check:
import sympy as sp
x = sp.symbols("x")
# Problem: 2x + 3 = 11
candidate_answer = 4
ok = sp.simplify(2 * candidate_answer + 3 - 11) == 0
print(ok) # True
Derivative check:
import sympy as sp
x = sp.symbols("x")
expr = x**2 + 3*x
candidate = 2*x + 3
ok = sp.simplify(sp.diff(expr, x) - candidate) == 0
print(ok) # True
Verification is easier for:
- arithmetic
- algebra
- equations
- symbolic simplification
- derivatives
- matrix operations
- many numerical problems
Verification is harder for:
- geometry with diagrams
- proof problems
- vague word problems
- open-ended explanations
- cases with many equivalent answer forms
For a first version, I would start with problem types that can be automatically verified.
7. Human review should be part of the design
I would not ask humans to write the whole dataset manually. Instead, let automation create candidates and ask humans to approve, reject, or fix uncertain cases.
Useful tools:
- Label Studio
- Label Studio PDF OCR template
- Label Studio OCR template
- Label Studio document processing integrations
- Argilla
- Argilla docs
- Argilla on Hugging Face Hub
- Hugging Face LLM Course: Argilla
Example review metadata:
{
"review_status": "needs_review",
"review_reason": "formula_ocr_low_confidence"
}
or:
{
"review_status": "approved",
"reviewer_role": "domain_expert"
}
8. Existing math datasets worth studying
Before designing your own schema, inspect existing dataset cards and samples.
Useful examples:
- nvidia/OpenMathInstruct-1
- OpenMathInstruct-1 paper
- NVIDIA OpenMath collection
- math-ai/TemplateGSM
- TemplateMath GitHub
- meta-math/MetaMathQA
- MetaMath paper page
- TIGER-Lab/MathInstruct
- MAmmoTH paper
- MAmmoTH GitHub
- AI-MO/NuminaMath-CoT
- AI-MO/NuminaMath-TIR
- AI-MO NuminaMath collection
- Hugging Face open-r1 issue: datasets for math
These are useful not only as data sources, but also as examples of schema design, solution formatting, dataset cards, and metadata.
9. Synthetic-data pipeline tools
If you want a more general synthetic-data workflow, these may help:
- distilabel
- distilabel GitHub
- Hugging Face docs: distilabel datasets integration
- Hugging Face blog: Synthetic data with Llama 3 and distilabel
- NVIDIA NeMo Data Designer
- NVIDIA NeMo Data Designer GitHub
- NeMo Curator synthetic data generation docs
- Gretel docs
- Gretel Trainer docs
For math, I would still prioritize verified generation over generic synthetic text generation.
10. RAG evaluation is a different use case
If the goal is to evaluate a RAG system over PDFs, use a different pipeline.
RAG evaluation asks:
Can the system retrieve from the PDF and answer correctly?
Math fine-tuning asks:
Can the model learn to solve this kind of math problem?
These are different tasks.
For RAG evaluation from documents, see:
- Ragas testset generation docs
- Ragas test data generation concepts
- Ragas GitHub
- Unstructured + Ragas synthetic test data example
If you only want a PDF QA system, RAG may be easier than fine-tuning. If you want a model to internalize a domain-specific math problem style, then you need a stronger dataset-generation pipeline.
11. Hugging Face Datasets basics
Once the data is clean, Hugging Face Datasets is the relatively easy part.
Good starting docs:
- Create a dataset
- Load local and remote files
- Process a dataset
- Upload a dataset
- Uploading datasets to the Hub
- Dataset upload decision guide
- Dataset cards
- Create a dataset card
- Gated datasets
- Repository licenses
Common file choices:
- JSONL: easiest for prototyping and inspection
- Parquet: usually better for larger datasets
- CSV: fine for simple flat data, less convenient for nested metadata
- Arrow / save_to_disk: useful for local processed datasets
Example JSONL:
{"id":"ex_000001","problem":"Solve 2x + 3 = 11.","answer":"x = 4","solution":"Subtract 3 from both sides to get 2x = 8. Divide by 2 to get x = 4.","topic":"linear_equations","difficulty":"easy","synthetic":true,"verified":true}
{"id":"ex_000002","problem":"Find the derivative of x^2 + 3x.","answer":"2x + 3","solution":"Differentiate term by term: d/dx x^2 = 2x and d/dx 3x = 3.","topic":"calculus","difficulty":"easy","synthetic":true,"verified":true}
Load locally:
from datasets import load_dataset
ds = load_dataset(
"json",
data_files={
"train": "train.jsonl",
"validation": "validation.jsonl",
"test": "test.jsonl",
},
)
print(ds)
print(ds["train"][0])
Push to the Hub:
from datasets import load_dataset
ds = load_dataset("json", data_files={"train": "train.jsonl"})
ds.push_to_hub("<your-username>/<your-dataset-name>")
Push privately:
ds.push_to_hub("<your-username>/<your-private-dataset-name>", private=True)
For a first version, I would upload privately, inspect the Dataset Viewer, fix schema problems, and only then consider making it public.
12. Keep a rich source dataset, then export SFT views
I would not make the SFT format the only copy of the data.
Keep a rich source dataset first:
{
"id": "synthetic_linear_equation_000001",
"problem": "Solve 5x - 7 = 18.",
"answer": "x = 5",
"solution": "Add 7 to both sides to get 5x = 25. Divide by 5 to get x = 5.",
"topic": "algebra",
"subtopic": "linear_equations",
"difficulty": "easy",
"synthetic": true,
"source_basis": {
"source_file": "algebra_notes.pdf",
"source_page": 12,
"source_problem_id": "seed_doc001_p012_q03"
},
"generation": {
"method": "template_parameterized",
"generator_version": "linear_equation_v1",
"llm_used_for_wording": false
},
"verification": {
"method": "python_symbolic_check",
"status": "passed"
},
"review": {
"status": "approved"
}
}
Then export separate views:
raw_extracted_problems.jsonl
verified_synthetic_math.jsonl
train_sft_prompt_completion.jsonl
train_sft_chat.jsonl
eval_verified_only.jsonl
human_review_queue.jsonl
This makes the data easier to audit and repair.
13. Dataset format for SFT
If you plan to use TRL SFTTrainer, common formats include:
- plain
text prompt/completion- chat-style
messages
Docs:
Prompt-completion example:
{"prompt":"Solve 2x + 3 = 11.","completion":"The answer is x = 4. Subtract 3 from both sides to get 2x = 8, then divide by 2."}
Chat format example:
{"messages":[{"role":"user","content":"Solve 2x + 3 = 11."},{"role":"assistant","content":"The answer is x = 4. Subtract 3 from both sides to get 2x = 8, then divide by 2."}]}
Again, I would treat these as exported training views, not as the only source of truth.
14. Dataset card and licensing notes
If the dataset is uploaded to Hugging Face, write a real dataset card.
Include:
- source document description
- whether examples are extracted, synthetic, or both
- generation method
- verification method
- human review process
- license
- intended use
- out-of-scope use
- known limitations
- data fields
- train/validation/test split method
- benchmark contamination precautions
- copyright / rights statement
Useful links:
Important caution: if the PDFs are copyrighted textbooks, paid worksheets, scanned books, course materials, or proprietary documents, you may not have the right to publish extracted or derived datasets. Even synthetic variants can be risky if they are too close to the source problems.
Private or gated datasets help with access control, but they do not automatically solve licensing problems.
15. Current Hugging Face Datasets caution
For a new dataset, I would prefer simple data files such as JSONL or Parquet plus a clear README/dataset card.
Avoid relying on old examples that require custom remote dataset loading scripts.
Relevant links:
- Datasets loading docs
- Uploading datasets to the Hub
- GitHub issue: Dataset scripts are no longer supported
- Forum: Dataset scripts are no longer supported
For a new project, JSONL/Parquet + metadata + dataset card is usually the safer path.
16. Suggested proof of concept
I would not begin with all PDFs.
Start small:
20-50 pages
↓
extract 100-300 seed problems
↓
classify by topic and problem type
↓
choose 3-5 problem types that are easy to verify
↓
generate 1,000-5,000 synthetic variants
↓
verify with Python/SymPy
↓
manually review a sample
↓
upload a private HF dataset
Measure:
- OCR/formula error rate
- problem segmentation accuracy
- topic classification accuracy
- answer verification pass rate
- duplicate / near-duplicate rate
- human review time per accepted example
- final usable examples per source page
This will tell you whether full-scale generation is realistic.
Practical recommendation
If I were building this, I would do the following:
- Convert PDFs/scans to Markdown, LaTeX, or structured JSON using Mathpix, Docling, or Marker.
- Extract seed problems with page number, bounding box, and raw OCR preserved.
- Use an LLM to classify topic, subtopic, problem type, and difficulty.
- Start with a few high-frequency, easy-to-verify problem types.
- Build template or code generators for those types.
- Generate answers using code or SymPy when possible.
- Use an LLM for natural wording and explanation polishing, not as the final verifier.
- Run automatic checks.
- Send uncertain examples to Label Studio or Argilla.
- Store the rich dataset as JSONL or Parquet.
- Export a separate SFT-ready view.
- Upload first as a private Hugging Face dataset.
- Add a dataset card explaining source, generation, verification, review, license, and limitations.
The safest summary is:
Extract seed problems from the PDFs, learn the problem patterns, generate new variants with templates or code, verify answers programmatically, then package the result with Hugging Face Datasets.
That is much safer than directly asking an LLM to generate thousands of math problem-answer pairs from PDFs.