How can I replace text in PDF with iText without encoding issue? (Android)

Question

I need some help. I want to replace a text with an another in a PDF file (I'm using iText library), but when I'm trying to do it with accent letters, it has encoding problems.

public static void manipulatePdf(String src, String dest) throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfDictionary dict = reader.getPageN(1);
    PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
    if (object instanceof PRStream) {
        PRStream stream = (PRStream) object;
        byte[] data = PdfReader.getStreamBytes(stream);


        String eredeti = "öüóá";
        final String s = new String(eredeti.getBytes(), BaseFont.CP1250);

        stream.setData(new String(data).replace("Hello World", s).getBytes());
    }
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    stamper.close();
    reader.close();
}

But when I open the PDF file, I see this: Wrong PDF

I've already tried all encoding types, to get the right letters (öüóá), but it never worked for me.

Has anybody any idea what should I do ?

The page dictionary you get with `getPageN()` has a `/Resources` entry. This entry contains references to a font. If it's a simple font, it defines a maximum of 256 characters. It may very well be that the characters you need aren't there. If it's a composite font, it will most likely only contain a subset of characters that are already used in the document. The characters you need may not be there. All in all, this is a bad question. The code you share shouldn't be used. The problem you're trying to solve is documented as "don't try to do this." — Bruno Lowagie, Jun 12 '16 at 17:14
Ákos, not only are there the possible problems hinted at by @Bruno (characters not present in font) and obviously encoding problems, it also is very dangerous to treat the content stream as if it were a character string with a single encoding: Unless you know what you are doing and have sanitized your inputs accordingly, you have a good chance of making the stream content invalid. PDFBox used to have a bundled example doing something similar, and for all the reasons mentioned above they removed it from their distribution and now warn against doing something like that, too. — mkl, Jun 13 '16 at 11:27
Thank you guys for the quick support... If my code is wrong in this way, is there any method to do what I want, or should I give it up ? — arathus98, Jun 13 '16 at 14:49
*is there any method to do what I want, or should I give it up* - it depends on how much time you have for that and whether you want to change only a few internally simply built PDFs or generic PDFs from the wild. Editing PDF contents is difficult due to the different encodings and subsets involved and due to possibly necessary reflowing of the text. If on the other hand you know your PDFs are simple in these regards (i.e. only using standard encodings, fonts (if embedded) being fairly complete, no reflowing necessary), there are ways to do this with a sensible amount of time and resources. — mkl, Jun 14 '16 at 14:09
In case I should replace texts in a PDF file. If there is any trick to do this please share your idea with me. Don't matter if it took long I have enough time for that ;) — arathus98, Jun 14 '16 at 14:16

score 1 · Accepted Answer · answered Jun 20 '16 at 17:18

1

I already found the solution ;)

The problem was that, I have encoded the string before I put it into the PDF file. You should encode your string when exactly you put it into the PDF, just like here:

    stream.setData(new String(data).replace("Hello World", s).getBytes("ISO-8859-2"));

You can see the final form of my code here:

public static void manipulatePdf(String src, String dest) throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfDictionary dict = reader.getPageN(1);
    PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
    if (object instanceof PRStream) {
        PRStream stream = (PRStream) object;
        byte[] data = PdfReader.getStreamBytes(stream);


        String eredeti = "öűóá";
        final String s = new String(eredeti.getBytes());

        stream.setData(new String(data).replace("Hello World", s).getBytes("ISO-8859-2"));
    }
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    stamper.close();

    Paragraph preface = new Paragraph();
    preface.setAlignment(Element.ALIGN_CENTER);

    reader.close();
}

answered Jun 20 '16 at 17:18

arathus98

21
1
1
7

There are a few pdfs out there this works for. There are more out there this doesn't work for. – mkl Jun 21 '16 at 05:21
I tested in several pdf files, but it worked every time for me :/ – arathus98 Jun 22 '16 at 11:44
Take e.g. the [free copy of the PDF specification](http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf), page 1, the title page, and try to replace "First" (from "First Edition") by "Second"; or "management" (from "Document management") by "display"; or "2008" (from "2008-7-1") by "1234". And this document is very tame internally... – mkl Jun 22 '16 at 16:37
@mkl, this works fine when the `PdfObject object = dict.getDirectObject(PdfName.CONTENTS);` return object of PRStream – Danyal Sandeelo Mar 25 '19 at 14:43
@DanyalSandeelo that is a necessary precondition but not a sufficient one. – mkl Mar 25 '19 at 15:11
@mkl, yes true. I got it working. When the response is returned as array there is another way to iterate over it – Danyal Sandeelo Mar 26 '19 at 12:15
@DanyalSandeelo You probably got it working for simple PDFs, using only **WinAnsiEncoding** and drawing full lines using a single string. There are such PDFs but there also are lots and lots of PDFs generated differently. So as I said in my first comment here: *There are a few pdfs out there this works for. There are more out there this doesn't work for.* – mkl Mar 26 '19 at 12:37
@mkl true, couldn't agree more – Danyal Sandeelo Mar 26 '19 at 12:41
@mkl I am able to make it work with different pdfs but I can replace the data with Arabic words, Should I use UTF8 explicitly and pass the Arabic word as encoded string? https://stackoverflow.com/questions/55439445/updating-text-of-a-pdf-cannot-replace-string-with-arabic-word-itext-java – Danyal Sandeelo Mar 31 '19 at 10:47

How can I replace text in PDF with iText without encoding issue? (Android)

1 Answers1