Is there a way in Tesseract to capture text meta-data along with text?

Question

I am trying to figure out if text metadata like font-size, font-family, bold/italic etc. can be captured using Tesseract. Below is the code I used to try it but that did not work and returned "None". Using, Tesseract version = 4.1.1, Tesseract-OCR engine version = 5.0.0

with open(Image_file_location, "rb") as image:
f = image.read()
b = bytearray(f)

with tesserocr.PyTessBaseAPI() as api:
    image = Image.open(io.BytesIO(b))
    api.SetImage(image)
    api.Recognize()
    iterator = api.GetIterator()
    print(iterator.WordFontAttributes())

Currently, using Tesseract, I was able to capture text properly but not meta-data. I have attached a sample image file and example expected output.

Expected Output: [Font:"some_font", Font_family:"some_font_family", Bold, font_size:"some_font_size] GCEO Review

[Font:"some_font", Font_family:"some_font_family", Bold, font_size:"some_font_size] Dear Shareholders,

[Font:"some_font", Font_family:"some_font_family", Bold, font_size:"some_font_size] TURNING THE....

[Font:"some_font", Font_family:"some_font_family", Bold, font_size:"some_font_size] We have executed well and gained mobile share in our core.........

So, basically, wherever there is a change in meta-data, we should be able to capture the information and prepend that information before that sentence.

No answer/suggestions from any fellow member... Does this mean, it's not possible?? — Crusader, Sep 12 '20 at 18:33
I'm about to ask a similar question (although with emphasis on bolding), but I fear the answer is "no", at least not in Tesseract. Stand by... — Mike Maxwell, May 17 '21 at 21:38
ok, my post is https://stackoverflow.com/questions/67577793/. — Mike Maxwell, May 17 '21 at 22:50

Is there a way in Tesseract to capture text meta-data along with text?

0 Answers0