How I can create my own synthetic dataset based on my PDF?

John6666 · May 8, 2026, 4:11am

Since PDFs come in a wide range of formats—from relatively well-structured documents to those that are almost entirely images—I believe OCR is the biggest challenge when creating a dataset based on PDFs. In any case, we can’t expect the same level of precision as with LaTeX. Furthermore, if we have an LLM generate the “Answer” part of Q&A pairs, the resulting data may be of questionable quality for a mathematics dataset.

There are likely to be many hurdles:

There are two related but different tasks here:

extracting existing problems from your PDFs, and
generating new synthetic problems inspired by those extracted examples.

I would keep those stages separate. For math data, a simple pipeline like this is probably unsafe:

PDF → LLM → many problem-answer pairs → fine-tuning

The risk is not just bad wording. The risk is that OCR may corrupt formulas, problem boundaries may be wrong, answers may be mathematically false, solutions may contain flawed reasoning, and the final dataset may become hard to audit.

A safer design is closer to this:

PDFs / scanned pages
  ↓
OCR + layout + formula extraction
  ↓
seed problem extraction
  ↓
topic / difficulty / problem-type classification
  ↓
synthetic generation
  ↓
answer generation
  ↓
symbolic / numeric / code-based verification
  ↓
human review for uncertain examples
  ↓
JSONL / Parquet / Hugging Face Dataset

In short: use LLMs as part of the pipeline, but do not use an LLM alone as the final source of mathematical truth.

1. First separate “extracted data” from “synthetic data”

I would maintain at least two dataset layers.

Extracted seed dataset

This contains problems actually extracted from the PDFs.

Example:

{
  "id": "seed_doc001_p012_q03",
  "source_file": "algebra_notes.pdf",
  "source_page": 12,
  "bbox": [120, 240, 980, 420],
  "problem": "Solve 2x + 3 = 11.",
  "raw_ocr": "...",
  "topic": "algebra",
  "subtopic": "linear_equations",
  "difficulty": "easy",
  "needs_review": true
}

Synthetic dataset

This contains new problems derived from the extracted seed patterns.

Example:

{
  "id": "synthetic_linear_equation_000001",
  "problem": "Solve 5x - 7 = 18.",
  "answer": "x = 5",
  "solution": "Add 7 to both sides to get 5x = 25. Divide by 5 to get x = 5.",
  "topic": "algebra",
  "subtopic": "linear_equations",
  "difficulty": "easy",
  "synthetic": true,
  "source_basis": {
    "source_file": "algebra_notes.pdf",
    "source_page": 12,
    "source_problem_id": "seed_doc001_p012_q03"
  },
  "generation": {
    "method": "template_parameterized",
    "generator_version": "linear_equation_v1"
  },
  "verification": {
    "method": "python_symbolic_check",
    "status": "passed"
  },
  "review": {
    "status": "approved"
  }
}

This separation matters because extracted examples, generated examples, verified examples, and training-format examples are not the same thing.

2. Approach A: template-based generation

This is usually the safest first approach for math.

Start from a seed problem:

Solve 2x + 3 = 11.

Infer a template:

Solve ax + b = c.

Then generate variants by changing parameters:

Solve 5x - 7 = 18.
Solve 3x + 4 = 22.
Solve 9x - 2 = 43.

The key advantage is that the answer can be produced by code instead of guessed by an LLM.

import random

def make_linear_equation():
    x = random.randint(-20, 20)
    a = random.choice([i for i in range(-10, 11) if i != 0])
    b = random.randint(-30, 30)
    c = a * x + b

    return {
        "problem": f"Solve {a}x + {b} = {c}.",
        "answer": f"x = {x}",
        "solution": (
            f"Subtract {b} from both sides to get {a}x = {a*x}. "
            f"Then divide by {a}, so x = {x}."
        ),
        "topic": "algebra",
        "subtopic": "linear_equations",
        "synthetic": True,
        "verified": True,
        "generation_method": "template_parameterized"
    }

This works well for many controlled problem types:

arithmetic
linear equations
polynomial expansion
factoring
derivatives
matrix operations
simple probability/statistics
many linear algebra exercises

Relevant examples:

If I were starting this project, I would begin with this approach.

3. Approach B: Evol-Instruct style generation

This approach uses an LLM to transform seed problems into more diverse, harder, or more natural variants.

Seed:

Solve 2x + 3 = 11.

Possible variants:

A number is doubled and then increased by 3. The result is 11. Find the number.

Solve 2(3x - 1) + 5 = 21.

A student tried to solve 2x + 3 = 11 but made a mistake. Identify and correct the mistake.

This is useful when you want:

word problems
natural language variation
multi-step reasoning
harder variants
curriculum-like progression
explanation-style examples

But this is riskier than template generation. The LLM may accidentally change the mathematical structure, introduce contradictions, or generate a wrong solution.

Relevant examples:

I would use this only with strong filtering, verification, and sampling-based human review.

4. Approach C: code-based verified generation

This is the strongest approach if the goal is a serious math fine-tuning dataset.

Instead of asking an LLM to directly write many problem-answer pairs, use the PDFs to discover common problem patterns, then build code generators for those patterns.

Example idea:

def generate_eigenvalue_problem():
    # 1. sample a matrix with controlled properties
    # 2. compute eigenvalues using sympy
    # 3. generate a problem statement
    # 4. generate a verified answer and solution
    # 5. optionally use an LLM to polish the wording
    pass

This gives you synthetic problems whose answers are derived from code. The LLM can still help with wording and explanations, but the mathematical answer should be independently checked.

Relevant examples:

OpenMathInstruct-1 is especially relevant because it uses synthetic code-interpreter-style solutions for math instruction tuning. That is a good design signal: for math, it is often better to involve executable code than to rely on plain natural-language reasoning alone.

5. PDF / OCR / formula extraction

Before generating synthetic data, the PDF extraction step needs attention.

For normal text PDFs, basic extraction may be enough. For scanned pages, math-heavy documents, tables, diagrams, and formulas, ordinary OCR may fail.

Useful tools:

Mathpix — strong option for STEM OCR, math, tables, LaTeX, Markdown, and PDF conversion.
Mathpix API docs
Mathpix PDF data extraction
Mathpix PDF to Markdown
Mathpix Markdown
Docling — document conversion with layout, reading order, OCR, tables, formulas, Markdown/JSON output.
Docling docs
Docling GitHub
Docling technical report
Docling code/formula extraction example
Marker GitHub
Unstructured GitHub
Unstructured partitioning docs
Nougat paper
Nougat GitHub
GROBID docs

For scanned math worksheets or formula-heavy PDFs, I would test Mathpix first. For open-source or local processing, I would test Docling, Marker, Nougat, or Unstructured depending on the PDF type.

Do not store only the cleaned text. Preserve traceability:

{
  "source_file": "linear_algebra_notes_03.pdf",
  "page": 14,
  "bbox": [120, 240, 980, 420],
  "raw_ocr": "...",
  "problem_text": "...",
  "formula_latex": "...",
  "extraction_confidence": 0.82,
  "needs_review": true
}

This makes later debugging much easier.

6. Verification is the central problem

For math, the main question is not “Can an LLM generate answers?” The main question is “How do we reject wrong answers?”

A wrong answer in a math fine-tuning dataset is especially damaging because the model may learn incorrect reasoning patterns.

Use programmatic verification whenever possible.

Useful tools and techniques:

SymPy
SymPy docs
Python numeric checks
code execution
program-of-thought / tool-integrated solving
multiple independent solution methods
LLM-as-critic as a secondary filter, not the only verifier
human review for proofs, geometry, diagrams, and ambiguous word problems

Example check:

import sympy as sp

x = sp.symbols("x")

# Problem: 2x + 3 = 11
candidate_answer = 4

ok = sp.simplify(2 * candidate_answer + 3 - 11) == 0
print(ok)  # True

Derivative check:

import sympy as sp

x = sp.symbols("x")

expr = x**2 + 3*x
candidate = 2*x + 3

ok = sp.simplify(sp.diff(expr, x) - candidate) == 0
print(ok)  # True

Verification is easier for:

arithmetic
algebra
equations
symbolic simplification
derivatives
matrix operations
many numerical problems

Verification is harder for:

geometry with diagrams
proof problems
vague word problems
open-ended explanations
cases with many equivalent answer forms

For a first version, I would start with problem types that can be automatically verified.

7. Human review should be part of the design

I would not ask humans to write the whole dataset manually. Instead, let automation create candidates and ask humans to approve, reject, or fix uncertain cases.

Useful tools:

Example review metadata:

{
  "review_status": "needs_review",
  "review_reason": "formula_ocr_low_confidence"
}

or:

{
  "review_status": "approved",
  "reviewer_role": "domain_expert"
}

8. Existing math datasets worth studying

Before designing your own schema, inspect existing dataset cards and samples.

Useful examples:

These are useful not only as data sources, but also as examples of schema design, solution formatting, dataset cards, and metadata.

9. Synthetic-data pipeline tools

If you want a more general synthetic-data workflow, these may help:

For math, I would still prioritize verified generation over generic synthetic text generation.

10. RAG evaluation is a different use case

If the goal is to evaluate a RAG system over PDFs, use a different pipeline.

RAG evaluation asks:

Can the system retrieve from the PDF and answer correctly?

Math fine-tuning asks:

Can the model learn to solve this kind of math problem?

These are different tasks.

For RAG evaluation from documents, see:

If you only want a PDF QA system, RAG may be easier than fine-tuning. If you want a model to internalize a domain-specific math problem style, then you need a stronger dataset-generation pipeline.

11. Hugging Face Datasets basics

Once the data is clean, Hugging Face Datasets is the relatively easy part.

Good starting docs:

Common file choices:

JSONL: easiest for prototyping and inspection
Parquet: usually better for larger datasets
CSV: fine for simple flat data, less convenient for nested metadata
Arrow / save_to_disk: useful for local processed datasets

Example JSONL:

{"id":"ex_000001","problem":"Solve 2x + 3 = 11.","answer":"x = 4","solution":"Subtract 3 from both sides to get 2x = 8. Divide by 2 to get x = 4.","topic":"linear_equations","difficulty":"easy","synthetic":true,"verified":true}
{"id":"ex_000002","problem":"Find the derivative of x^2 + 3x.","answer":"2x + 3","solution":"Differentiate term by term: d/dx x^2 = 2x and d/dx 3x = 3.","topic":"calculus","difficulty":"easy","synthetic":true,"verified":true}

Load locally:

from datasets import load_dataset

ds = load_dataset(
    "json",
    data_files={
        "train": "train.jsonl",
        "validation": "validation.jsonl",
        "test": "test.jsonl",
    },
)

print(ds)
print(ds["train"][0])

Push to the Hub:

from datasets import load_dataset

ds = load_dataset("json", data_files={"train": "train.jsonl"})
ds.push_to_hub("<your-username>/<your-dataset-name>")

Push privately:

ds.push_to_hub("<your-username>/<your-private-dataset-name>", private=True)

For a first version, I would upload privately, inspect the Dataset Viewer, fix schema problems, and only then consider making it public.

12. Keep a rich source dataset, then export SFT views

I would not make the SFT format the only copy of the data.

Keep a rich source dataset first:

{
  "id": "synthetic_linear_equation_000001",
  "problem": "Solve 5x - 7 = 18.",
  "answer": "x = 5",
  "solution": "Add 7 to both sides to get 5x = 25. Divide by 5 to get x = 5.",
  "topic": "algebra",
  "subtopic": "linear_equations",
  "difficulty": "easy",
  "synthetic": true,
  "source_basis": {
    "source_file": "algebra_notes.pdf",
    "source_page": 12,
    "source_problem_id": "seed_doc001_p012_q03"
  },
  "generation": {
    "method": "template_parameterized",
    "generator_version": "linear_equation_v1",
    "llm_used_for_wording": false
  },
  "verification": {
    "method": "python_symbolic_check",
    "status": "passed"
  },
  "review": {
    "status": "approved"
  }
}

Then export separate views:

raw_extracted_problems.jsonl
verified_synthetic_math.jsonl
train_sft_prompt_completion.jsonl
train_sft_chat.jsonl
eval_verified_only.jsonl
human_review_queue.jsonl

This makes the data easier to audit and repair.

13. Dataset format for SFT

If you plan to use TRL SFTTrainer, common formats include:

plain text
prompt / completion
chat-style messages

Docs:

Prompt-completion example:

{"prompt":"Solve 2x + 3 = 11.","completion":"The answer is x = 4. Subtract 3 from both sides to get 2x = 8, then divide by 2."}

Chat format example:

{"messages":[{"role":"user","content":"Solve 2x + 3 = 11."},{"role":"assistant","content":"The answer is x = 4. Subtract 3 from both sides to get 2x = 8, then divide by 2."}]}

Again, I would treat these as exported training views, not as the only source of truth.

14. Dataset card and licensing notes

If the dataset is uploaded to Hugging Face, write a real dataset card.

Include:

source document description
whether examples are extracted, synthetic, or both
generation method
verification method
human review process
license
intended use
out-of-scope use
known limitations
data fields
train/validation/test split method
benchmark contamination precautions
copyright / rights statement

Useful links:

Important caution: if the PDFs are copyrighted textbooks, paid worksheets, scanned books, course materials, or proprietary documents, you may not have the right to publish extracted or derived datasets. Even synthetic variants can be risky if they are too close to the source problems.

Private or gated datasets help with access control, but they do not automatically solve licensing problems.

15. Current Hugging Face Datasets caution

For a new dataset, I would prefer simple data files such as JSONL or Parquet plus a clear README/dataset card.

Avoid relying on old examples that require custom remote dataset loading scripts.

Relevant links:

For a new project, JSONL/Parquet + metadata + dataset card is usually the safer path.

16. Suggested proof of concept

I would not begin with all PDFs.

Start small:

20-50 pages
  ↓
extract 100-300 seed problems
  ↓
classify by topic and problem type
  ↓
choose 3-5 problem types that are easy to verify
  ↓
generate 1,000-5,000 synthetic variants
  ↓
verify with Python/SymPy
  ↓
manually review a sample
  ↓
upload a private HF dataset

Measure:

OCR/formula error rate
problem segmentation accuracy
topic classification accuracy
answer verification pass rate
duplicate / near-duplicate rate
human review time per accepted example
final usable examples per source page

This will tell you whether full-scale generation is realistic.

Practical recommendation

If I were building this, I would do the following:

Convert PDFs/scans to Markdown, LaTeX, or structured JSON using Mathpix, Docling, or Marker.
Extract seed problems with page number, bounding box, and raw OCR preserved.
Use an LLM to classify topic, subtopic, problem type, and difficulty.
Start with a few high-frequency, easy-to-verify problem types.
Build template or code generators for those types.
Generate answers using code or SymPy when possible.
Use an LLM for natural wording and explanation polishing, not as the final verifier.
Run automatic checks.
Send uncertain examples to Label Studio or Argilla.
Store the rich dataset as JSONL or Parquet.
Export a separate SFT-ready view.
Upload first as a private Hugging Face dataset.
Add a dataset card explaining source, generation, verification, review, license, and limitations.

The safest summary is:

Extract seed problems from the PDFs, learn the problem patterns, generate new variants with templates or code, verify answers programmatically, then package the result with Hugging Face Datasets.

That is much safer than directly asking an LLM to generate thousands of math problem-answer pairs from PDFs.

Topic		Replies	Views
Synthetic Australian medical record PDF library (50-doc free sample) - feedback wanted on dataset 🤗Datasets	0	30	May 7, 2026
How to prepare dataset using patent pdf? 🤗Datasets	0	29	January 29, 2025
Generate dataset for fine tuning on PDF(s) 🤗Transformers	7	4484	August 3, 2025
How do I create Datasets from PDF files? Beginners	8	2424	August 3, 2025
Pdf data set issues 🤗Datasets	0	631	November 17, 2022

How I can create my own synthetic dataset based on my PDF?

1. First separate “extracted data” from “synthetic data”

Extracted seed dataset

Synthetic dataset

2. Approach A: template-based generation

3. Approach B: Evol-Instruct style generation

4. Approach C: code-based verified generation

5. PDF / OCR / formula extraction

6. Verification is the central problem

7. Human review should be part of the design

8. Existing math datasets worth studying

9. Synthetic-data pipeline tools

10. RAG evaluation is a different use case

11. Hugging Face Datasets basics

12. Keep a rich source dataset, then export SFT views

13. Dataset format for SFT

14. Dataset card and licensing notes

15. Current Hugging Face Datasets caution

16. Suggested proof of concept

Practical recommendation

Related topics