1

I'm extracting the text from a WordExtractor class (apache POI), but I have an error for some .doc files. Debugging, I saw that the line with the problem is the last one here:

HWPFDocument docx = new HWPFDocument(new FileInputStream(file));
WordExtractor we = new WordExtractor(docx);
String T = we.getText().replaceAll("\\n", " ").replaceAll("\\r", " ");

For most .docx and .doc files it's work fine.

The error message is:

Exception in thread "main" java.lang.RuntimeException: 
java.lang.IllegalArgumentException: The end (4958) must not be before the start (4990)

How can I fix it?

jmarkmurphy
  • 11,030
  • 31
  • 59
  • 4
    Please add the full stack trace. – Arnaud Jan 05 '17 at 15:12
  • 1
    Generally, unless you are using regular expressions, use [`replace`](http://docs.oracle.com/javase/8/docs/api/java/lang/String.html#replace-java.lang.CharSequence-java.lang.CharSequence-) rather than [`replaceAll`](http://docs.oracle.com/javase/8/docs/api/java/lang/String.html#replaceAll-java.lang.String-java.lang.String-). – khelwood Jan 05 '17 at 15:12
  • 2
    It's clearly not being thrown by the `replaceAll`s, so it must be the call to `we.getText()`. There is either a problem with how you've initialised the WordExtractor, the Word documents are corrupt, or there is a bug in the library. Please post the line where the `we` object gets created. – Michael Jan 05 '17 at 15:21
  • Which version of Apache POI are you using? Are you sure that you have a latest? – Hrabosch Jan 05 '17 at 15:21
  • Hi, thanks for answers. I'm using 3.15 Apache POI version. I added the line where the `we` object gets created. – Bustami - Ismael Gómez Jan 05 '17 at 15:27
  • 2
    I'm pretty sure I see what the problem is. `XWPFWordExtractor` is only for `.docx` files. I don't think it'll work for regular `.doc` files. Try using the base WordExtractor e.g. `WordExtractor we = new WordExtractor(new FileInputStream(file));` This would explain why it works for some documents and not others, at least. – Michael Jan 05 '17 at 15:39
  • Thanks, i tried with that but it not works yet. All this is strange cause it works well with a lot of `.doc` and `.docx` files. – Bustami - Ismael Gómez Jan 05 '17 at 15:49
  • 2
    create an issue at https://bz.apache.org/bugzilla/buglist.cgi?product=POI attach a file that fails, and a minimal snippet of code that shows the problem, and the devs will take a look at it. – jmarkmurphy Jan 05 '17 at 15:53
  • Here the bug https://bz.apache.org/bugzilla/show_bug.cgi?id=60556. Nothing yet. – Bustami - Ismael Gómez Jan 12 '17 at 18:44
  • this bug was solved time ago (https://bz.apache.org/bugzilla/show_bug.cgi?id=60556) The problem was with some hidden bookmarks on .doc files! – Bustami - Ismael Gómez Nov 15 '17 at 20:40

1 Answers1

1

XWPFWordExtractor from docs:

Helper class to extract text from an OOXML Word file

So this is your problem :) And solution from their docs:

For .doc files from Word 97 - Word 2003, in scratchpad there is org.apache.poi.hwpf.extractor.WordExtractor, which will return text for your document.

Those using POI 3.7 can also extract simple textual content from older Word 6 and Word 95 files, using the scratchpad class org.apache.poi.hwpf.extractor.Word6Extractor.

For .docx files, the relevant class is org.apache.poi.xwpf.extractor.XPFFWordExtractor

Hrabosch
  • 1,541
  • 8
  • 12