Error with getText().replaceAll() in java

Question

I'm extracting the text from a WordExtractor class (apache POI), but I have an error for some .doc files. Debugging, I saw that the line with the problem is the last one here:

HWPFDocument docx = new HWPFDocument(new FileInputStream(file));
WordExtractor we = new WordExtractor(docx);
String T = we.getText().replaceAll("\\n", " ").replaceAll("\\r", " ");

For most .docx and .doc files it's work fine.

The error message is:

Exception in thread "main" java.lang.RuntimeException: 
java.lang.IllegalArgumentException: The end (4958) must not be before the start (4990)

How can I fix it?

Generally, unless you are using regular expressions, use [`replace`](http://docs.oracle.com/javase/8/docs/api/java/lang/String.html#replace-java.lang.CharSequence-java.lang.CharSequence-) rather than [`replaceAll`](http://docs.oracle.com/javase/8/docs/api/java/lang/String.html#replaceAll-java.lang.String-java.lang.String-). — khelwood, Jan 05 '17 at 15:12
It's clearly not being thrown by the `replaceAll`s, so it must be the call to `we.getText()`. There is either a problem with how you've initialised the WordExtractor, the Word documents are corrupt, or there is a bug in the library. Please post the line where the `we` object gets created. — Michael, Jan 05 '17 at 15:21
Which version of Apache POI are you using? Are you sure that you have a latest? — Hrabosch, Jan 05 '17 at 15:21
Hi, thanks for answers. I'm using 3.15 Apache POI version. I added the line where the `we` object gets created. — Bustami - Ismael Gómez, Jan 05 '17 at 15:27
I'm pretty sure I see what the problem is. `XWPFWordExtractor` is only for `.docx` files. I don't think it'll work for regular `.doc` files. Try using the base WordExtractor e.g. `WordExtractor we = new WordExtractor(new FileInputStream(file));` This would explain why it works for some documents and not others, at least. — Michael, Jan 05 '17 at 15:39
Thanks, i tried with that but it not works yet. All this is strange cause it works well with a lot of `.doc` and `.docx` files. — Bustami - Ismael Gómez, Jan 05 '17 at 15:49
create an issue at https://bz.apache.org/bugzilla/buglist.cgi?product=POI attach a file that fails, and a minimal snippet of code that shows the problem, and the devs will take a look at it. — jmarkmurphy, Jan 05 '17 at 15:53
Here the bug https://bz.apache.org/bugzilla/show_bug.cgi?id=60556. Nothing yet. — Bustami - Ismael Gómez, Jan 12 '17 at 18:44
this bug was solved time ago (https://bz.apache.org/bugzilla/show_bug.cgi?id=60556) The problem was with some hidden bookmarks on .doc files! — Bustami - Ismael Gómez, Nov 15 '17 at 20:40

score 1 · Answer 1 · answered Jan 05 '17 at 15:43

1

XWPFWordExtractor from docs:

Helper class to extract text from an OOXML Word file

So this is your problem :) And solution from their docs:

For .doc files from Word 97 - Word 2003, in scratchpad there is org.apache.poi.hwpf.extractor.WordExtractor, which will return text for your document.

Those using POI 3.7 can also extract simple textual content from older Word 6 and Word 95 files, using the scratchpad class org.apache.poi.hwpf.extractor.Word6Extractor.

For .docx files, the relevant class is org.apache.poi.xwpf.extractor.XPFFWordExtractor

answered Jan 05 '17 at 15:43

Hrabosch

1,541
8
12

Hi, yes, i have and `if` statement for `.doc` files with HWPF and `.docx` files with XWPF. But sadly that is not the solution. – Bustami - Ismael Gómez Jan 05 '17 at 15:47
@Bustami You should include the if in your snippet as that is causing confusion. – jmarkmurphy Jan 05 '17 at 15:49
@Bustami Sorry but I dont understand what is your problem. There was problem with that you used bad extractor, now you updated your question with another one extractor. But what exactly you need? – Hrabosch Jan 05 '17 at 15:50
Sorry, the change of extractor only was a correction of a initial mistake here. My problem is the code fails with some `.doc` files. Thanks. – Bustami - Ismael Gómez Jan 05 '17 at 15:53
And why you dont use normal WordExtractor with POIFSFileSystem like your extractor constructor param? – Hrabosch Jan 05 '17 at 16:00
Strange... So then it should be bug on their side. – Hrabosch Jan 05 '17 at 16:14

Error with getText().replaceAll() in java

1 Answers1