0

Here you can download pdf with one acroform field and his size is exactly 427Kb

If I remove this unique field, file is 3Kb only, why this happens please ? I tried analyse using PDF Debugger and nothing seems weird to me.

enter image description here

pa1.Shetty
  • 401
  • 3
  • 16
ebeg
  • 418
  • 1
  • 4
  • 17
  • I can't look at the PDF now, but what you should do is to look at what's below AP/N . Maybe that one has a huge font in the resources? – Tilman Hausherr Apr 03 '19 at 08:44
  • Here is what I have [see gif](https://s2.gifyu.com/images/onefieldbigsize.gif) it seems only /Helv no?if it is normal /Helv has 140elements in /Differences – ebeg Apr 03 '19 at 08:51
  • What happens if you just save the file without making any changes? – Tilman Hausherr Apr 03 '19 at 10:16
  • @TilmanHausherr same pdf size after save only without any changes :| – ebeg Apr 03 '19 at 13:43
  • 1
    There's an embedded "Arial" font in the acroform default resources at `Root/AcroForm/DR/Font/Arial/FontDescriptor/FontFile2`. – Tilman Hausherr Apr 03 '19 at 14:10
  • 1
    indeed, apparently bee not only removes the field but also at least this font from the default resources... – mkl Apr 03 '19 at 14:11
  • what is the solution here ? arial isnt part of pdf by default ? – ebeg Apr 03 '19 at 14:26
  • No, Helvetica is (named "Helv" in the default resources). Either you or whoever created the pdf added it for no reason. The font is not used / referenced. – Tilman Hausherr Apr 03 '19 at 16:20
  • Thank you. Is there quick pdfbox way to check if fonts are referenced or not? – ebeg Apr 03 '19 at 16:22
  • For the acroform default resources you could check the /DA entry (default appearance) of each field whether it contains the font name. – Tilman Hausherr Apr 03 '19 at 19:31
  • If it does what is correct way remove them? – ebeg Apr 03 '19 at 20:06
  • call `getCOSObject()` on the acroform default resources and then remove the element from the array and then call `setDefaultResources`. Do you need an answer that does all this? I'm wondering whether one should bother so much about just one single inefficient file. – Tilman Hausherr Apr 04 '19 at 03:39
  • @Tilman You might want to combine your comments and make them an answer... – mkl Apr 04 '19 at 09:25

1 Answers1

2

There's an embedded "Arial" font in the acroform default resources, see Root/AcroForm/DR/Font/Arial/FontDescriptor/FontFile2.

Either you or whoever created the pdf added it for no reason. The font is not used / referenced. For the acroform default resources you could check the /DA entry (default appearance) of each field whether it contains the font name.

When you removed the field somehow you also removed the font from the acroForm default resources. (You didn't write how you removed it)

Here's some code to do it (null checks mostly missing):

    PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
    PDResources defaultResources = acroForm.getDefaultResources();
    COSDictionary fontDict = (COSDictionary) defaultResources.getCOSObject().getDictionaryObject(COSName.FONT);
    List<String> defaultAppearances = new ArrayList<>();
    List<COSName> fontDeletionList = new ArrayList<>();
    for (PDField field : acroForm.getFieldTree())
    {
        if (field instanceof PDVariableText)
        {
            PDVariableText vtField = (PDVariableText) field;
            defaultAppearances.add(vtField.getDefaultAppearance());
        }
    }
    for (COSName fontName : defaultResources.getFontNames())
    {
        if (COSName.HELV.equals(fontName) || COSName.ZA_DB.equals(fontName))
        {
            // Adobe default, always keep
            continue;
        }
        boolean found = false;
        for (String da : defaultAppearances)
        {
            if (da != null && da.contains("/" + fontName.getName()))
            {
                found = true;
                break;
            }
        }
        System.out.println(fontName + ": " + found);
        if (!found)
        {
            fontDeletionList.add(fontName);
        }
    }
    System.out.println("deletion list: " + fontDeletionList);
    for (COSName fontName : fontDeletionList)
    {
        fontDict.removeItem(fontName);
    }

The resulting file has 5KB size now.

I haven't checked the annotations. Some of them have also a /DA string but it is unclear if the acroform default resources fonts are to be used when reconstructing a missing appearance stream.

Update: Here's some additional code to replace Arial with Helv:

for (PDField field : acroForm.getFieldTree())
{
    if (field instanceof PDVariableText)
    {
        PDVariableText vtField = (PDVariableText) field;
        String defaultAppearance = vtField.getDefaultAppearance();
        if (defaultAppearance.startsWith("/Arial"))
        {
            vtField.setDefaultAppearance("/Helv " + defaultAppearance.substring(7));
            vtField.getWidgets().get(0).setAppearance(null); // this removes the font usage
            vtField.setValue(vtField.getValueAsString());
        }
        defaultAppearances.add(vtField.getDefaultAppearance());
    }
}

Note that this may not be a good idea, because the standard 14 fonts have only limited characters. Try

vtField.setValue("Ayşe");

and you'll get an exception.

More general code to replace font can be found in this answer.

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
  • `defaultAppearances` would be more efficient as a `Set`, but I'm too lazy now :-) – Tilman Hausherr Apr 04 '19 at 10:20
  • Dies this radically delete all font ?no test if they are really used in field or not. And i cant understand 500ko of font and why it is not subsetted. Thank you will try code – ebeg Apr 04 '19 at 11:06
  • The font is referenced in the default appearances, here: `/Helv 8.64 Tf 0 g`. So if it isn't there, it isn't used. (don't know about rich text, but we don't support that one anyway). It makes sense not to have it subsetted in an acroform, because the field (if it would use the font) may be edited so all glyphs are needed. – Tilman Hausherr Apr 04 '19 at 11:15
  • `Dies this radically delete all font ?` No! Only those that are not used. – Tilman Hausherr Apr 04 '19 at 11:18
  • i tried to apply you code to [this pdf 585Ko](https://drive.google.com/uc?id=1kIDizTRSn5ZN0mTLEo0MjsJzRkRQGo0x&export=download) and no change, always same size.I change your code to always delete all fonts from acroform resources even if they are used (except helv and zadb). Maybe now /arial and /Cour is on page resources but after passing it to a compressor tool no change – ebeg Apr 04 '19 at 13:04
  • here is code I used on pdf of 585Ko : [see code](https://pastebin.com/9eDfjNwh) it always seems acroform contain these fonts – ebeg Apr 04 '19 at 13:06
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/191255/discussion-between-bee-and-tilman-hausherr). – ebeg Apr 04 '19 at 13:10