Docling image captioning best VLM

swtb · April 25, 2025, 2:37pm

What is the current SOTA model for captioning images in documents?

I need good descriptions of diagrams. Most of the ones I have seen have very basic descriptions “the image contains a woman in a blue dress”. I need more like “The figure shows a flowchart representing a process of… that starts with…and ends with…key steps are…”

Or “The image depicts a scene in which people walk about in a modern cafe, key elements of the cafes design are…”

In other words I need a good paragraph that offers some insight into the image.

Any suggestions on models?

John6666 · April 25, 2025, 3:33pm

I’m not sure which VLM is strong in understanding the context of image content…
How about trying out some VLM that seem to perform well to some extent…

Topic		Replies	Views
Multimodal training 🤗Transformers	4	148	March 21, 2025
Image to text model that can take an additional text input 🤗Transformers	1	321	October 2, 2023
Image Captioning fine tuning 🤗Transformers	0	473	February 25, 2023
Image to Text model that can take an additional text as input for context 🤗Hub	1	544	September 5, 2023
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	13095	January 1, 2024

Docling image captioning best VLM

Related topics