-2

I would like to take only bold text from image using Tesseract and Java.

Example:

Thanksgiving day

I need "Thanksgiving" as text from image.

halfer
  • 19,824
  • 17
  • 99
  • 186
  • It helps enormously here if you can show whatever research you have, and any code you have written. – halfer Nov 26 '21 at 11:17

1 Answers1

2

Tesseract does not provide this information. But there might be some things, you can look into:

A) In Tesseract 3 there is a metadata result which contains a recognized font. Probably it is not super reliable, but it might work for basic fonts and detect bold and non-bold fonts.

B) In Tesseract 4 you can export HOCR output and configure it in a way to get boxes around each character (not sure about Tesseract 3). I am not sure how reliable these boxes are either, but if it is okay, you could use them to have a second algorithm (e.g. small Convolutional Neural Network) which just classifies whether a single character is bold or not and remove non-bold text from the tesseract output.

C) In case you have precise line boxes before using tesseract, you could also look into training an algorithm (Fully Convolutional Neural Network) which segments the part of the line which is bold, then crop the image and use tesseract only for the bold parts. This would probably the most technical solution, but I think it could work as well.

jns
  • 1,250
  • 1
  • 13
  • 29