Exploring DeepSeek-OCR: How Well Can It Read Images?

Today, I was surprised by DeepSeek’s most recent paper. With the title “DeepSeek-OCR: Contexts Optical Compression”, they present the DeepSeek-OCR model, already available on Hugging Face.

In this post, we will explore this model and understand how it extracts details from images, using the prompts provided in their repository.

Playing with DeepSeek-OCR

A fast and easy way to try the model capabilities is to go directly to Khang Doan’s space on Hugging Face.

With this neat interface, you can upload an image (even from the clipboard), select the resolution size, and the task type. The latter is the most interesting part, defining how the model will behave.

The task types come from the examples provided in the DeepSeek-OCR repository.

# document: <image>\n<|grounding|>Convert the document to markdown.
# without layouts: <image>\nFree OCR.
# figures in document: <image>\nParse the figure.
# rec: <image>\nLocate <|ref|>xxxx<|/ref|> in the image.
# general: <image>\nDescribe this image in detail.

To have more control over the model execution you can run this Google Colab.

Experiments

Describing the image in detail

Prompt: <image>\nDescribe this image in detail.

ImageResult
This image displays a newspaper article about the stock market. The article is titled “Nasdaq & AMEX” and it is dated “Tuesday, March 12, 2009”. The article is about the stock market performance of the Nasdaq and the American Stock Exchange (AMEX) over the past week. The article states that the Nasdaq is up 52.1% and the AMEX is up 51.7% over the past week. The article also states that the S&P 500 is up 1.2% over the past week. The article is written in a professional and objective tone. The article is written in a font called “Arial”. The image is a screenshot of the article.

Convert to markdown

Using the same image, but changing the prompt to <image>\n<|grounding|>Convert the document to markdown., we get:

<|ref|>title<|/ref|><|det|>[[65, 142, 740, 255]]<|/det|>
Nasdaq & AMEX

<|ref|>text<|/ref|><|det|>[[63, 275, 585, 320]]<|/det|>
Stocks in bold rose or fell 5% or more

<|ref|>image<|/ref|><|det|>[[65, 339, 203, 450]]<|/det|>

<|ref|>text<|/ref|><|det|>[[210, 333, 886, 440]]<|/det|>
Track your investments with our continuously updated stocks. Visit us on the web at money.usatoday.com

<|ref|>table<|/ref|><|det|>[[55, 456, 976, 996]]<|/det|>

<table><tr><td colspan="3">52-week</td><td colspan="3">52-week</td><td colspan="3"></td></tr><tr><td>High</td><td>Low</td><td>Stock</td><td>Last Change</td><td>High</td><td>Low</td><td>Stock</td><td>Last Change</td></tr><tr><td></td><td></td><td></td><td></td><td>45.71</td><td>32.50</td><td>Biomet</td><td>36.71 -0.42</td></tr><tr><td></td><td></td><td></td><td></td><td>2.76</td><td>1.20</td><td>Biomira</td><td>1.46 +0.03</td></tr><tr><td></td><td></td><td></td><td></td><td>9.07</td><td>5.13</td><td>BioScrip</td><td>8.05 +0.34</td></tr><tr><td>9.19</td><td>6.89</td><td>ABX Air n</td><td>7.52-0.10</td><td>68.88</td><td>50.65</td><td>Biosite</td><td>50.05 -4.57</td></tr><tr><td>33.25</td><td>12.40</td><td>ACMoore</td><td>13.58-1.57</td><td>212.25</td><td>131.03</td><td>BiotechT</td><td>204.66 -0.84</td></tr><tr><td>31.38</td><td>13.51</td><td>ADA-ES</td><td>20.96+3.16</td><td>8.50</td><td>1.40</td><td>BirchMtf gn</td><td>6.52 -0.45</td></tr><tr><td>27.14</td><td>12.88</td><td>ADC Tel rs</td><td>23.21+0.13</td><td>18.21</td><td>10.73</td><td>Bickboud</td><td>17.90 +0.70</td></tr><tr><td>30.40</td><td>16.70</td><td>ADECP</td><td>27.32+0.73</td><td>52.73</td><td>13.86</td><td>BluCoat</td><td>41.29 +1.30</td></tr><tr><td>16.45</td><td>10.47</td><td>AFC Ent s</td><td>15.40-0.14</td><td>44.35</td><td>24.15</td><td>BlueNile</td><td>40.30 -1.10</td></tr><tr><td>8.37</td><td>4.50</td><td>ASE Tst</td><td>7.76+0.40</td><td>26.45</td><td>19.91</td><td>BobEvn</td><td>22.99</td></tr><tr><td>19.25</td><td>12.75</td><td>ASM Intl</td><td>17.65-0.03</td><td>15.94</td><td>6.12</td><td>Bodisen n</td><td>15.45 +0.45</td></tr><tr><td>20.92</td><td>13.94</td><td>ASML Hid</td><td>21.24+0.46</td><td>6.21</td><td>1.56</td><td>Bookham</td><td>5.94 +0.06</td></tr><tr><td>27.38</td><td>16.39</td><td>ASV Inc s</td><td>26.76+0.14</td><td>11.80</td><td>4.99</td><td>Borland</td><td>6.68 +0.14</td></tr><tr><td>19.82</td><td>10.47</td><td>ATI Tech</td><td>17.89+0.68</td><td>31.90</td><td>21.10</td><td>BostPrv</td><td>31.18 -0.07</td></tr><tr><td>33.62</td><td>20.53</td><td>ATMI Inc</td><td>29.95+1.29</td><td>18.62</td><td>10.01</td><td>BftmlnT</td><td>11.53 +0.20</td></tr><tr><td>39.20</td><td>16.76</td><td>ATP O&G</td><td>38.40-0.59</td><td>14.68</td><td>7.10</td><td>BrigExp</td><td>12.10 -0.23</td></tr><tr><td>4.24</td><td>1.99</td><td>AVI Bio</td><td>3.62-0.02</td><td>46.72</td><td>26.65</td><td>BrightHrz s</td><td>38.90 -0.80</td></tr></table>

Free OCR

Prompt: <image>\nFree OCR.

ImageResult
<|begin▁of▁sentence|>
#3.1.1.1.1.1.1.1.1.1.1.2.1.1.1.1.1.1.1.1.1.3.1.1.1.1.1.1.1.1.1
# 3.1.1.1.1.1.1.1.1
## 3.1.1.1.1.1.1.1.1.2

This is not what I expected. I tried same image changing the size from Gundam to Small and got:

<|begin▁of▁sentence|>"""
    Banco Santander, S.A.
    Instituto de Banca Múltiple
    Grupo Financiero Santander

2408:7002801
"""

This is what I expect. If your image is low quality you can adjust parameters to frame it better.

Figures in document

Prompt: <image>\nParse the figure.

ImageResult
Employee Count by Department
Percentage
Marketing15.2%
Sales18.2%
Finance12.1%
Human Resources9.1%
Operations24.2%
IT21.2%

It went well here but inverted IT and Operations.

Locate reference in image

Prompt: <image>\nLocate <|ref|>the hat<|/ref|> in the image.

ImageResult

Conclusion

The DeepSeek-OCR model is quite versatile and can handle various tasks related to extracting information from images. It’s interesting how it combines OCR capabilities with contextual understanding to provide more than just raw text extraction. The real thing behind DeepSeek-OCR is not just OCR. It’s, in fact, a step towards compressing text information using visual modality since a document image can represent rich information using way fewer tokens than the text itself.

This opens up a possibility for bigger context windows, and maybe even more. Let’s see how this field evolves!

We want to work with you. Check out our Services page!

Edy Silva

I own a computer

View all posts by Edy Silva →