Docling image captioning best VLM

What is the current SOTA model for captioning images in documents?

I need good descriptions of diagrams. Most of the ones I have seen have very basic descriptions “the image contains a woman in a blue dress”. I need more like “The figure shows a flowchart representing a process of… that starts with…and ends with…key steps are…”

Or “The image depicts a scene in which people walk about in a modern cafe, key elements of the cafes design are…”

In other words I need a good paragraph that offers some insight into the image.

Any suggestions on models?

I’m not sure which VLM is strong in understanding the context of image content…
How about trying out some VLM that seem to perform well to some extent…