8

Is it possible to get the font of the recognized characters with Tesseract-OCR, i.e. are they Arial or Times New Roman, either from the command-line or using the API.

I'm scanning documents that might have different parts with different fonts, and it would be useful to have this information.

sashoalm
  • 75,001
  • 122
  • 434
  • 781

2 Answers2

7

Tesseract has an API WordFontAttributes function defined in ResultIterator class that you can use.

nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • 4
    As of a recent Tesseract version, `WordFontAttributes` seems to return `None` no matter what. https://github.com/tesseract-ocr/tesseract/issues/1074 – Inaimathi Sep 20 '18 at 18:08
  • 2
    Yes, I can see why because it is using a neural network now. Any more of the updates is appreciated. – NONONONONO Mar 17 '19 at 23:59
6

Based on nguyenq's answer i wrote a simple python script that prints the font name for each detected char. This script uses the python lib tesserocr.

from tesserocr import PyTessBaseAPI, RIL, iterate_level

def get_font(image_path):
    with PyTessBaseAPI() as api:
        api.SetImageFile(image_path)
        api.Recognize()
        ri = api.GetIterator()
        level = RIL.SYMBOL
    
        for r in iterate_level(ri, level):
            symbol = r.GetUTF8Text(level)
            word_attributes = r.WordFontAttributes()

            if symbol:
                 print(u'symbol {}, font: {}'.format(symbol, word_attributes['font_name']))

  get_font('logo.jpg')
Pikamander2
  • 7,332
  • 3
  • 48
  • 69
szuuuken
  • 896
  • 10
  • 12
  • 5
    What config & version of tesseract, tessdata and other dependencies used? I am getting word_attributes as None – Lalit Jha Jul 31 '19 at 14:45
  • I also have word_Attributes as None if anyone has figured this out? – Reed Jones Sep 06 '22 at 17:04
  • On the link above it says "The LSTM engine does not support font attributes other than point size, and as I said 4 years ago, it won't support these attributes any time soon (It is not planned). However, the legacy engine is still available in versions 4.x and 5.x and it supports these attributes. You need a model that includes data for the legacy engine and you need to use --oem 0 (It might also work with --oem 3, not sure)." – Matt Hudson Apr 12 '23 at 20:32