0

For a school project i'm working on an image extractor for PDF's for this i'm using the PDFBox library. The problem i'm facing now is to get the metadata, so far I only managed to get the metadata from the PDF itself but not from the images inside the PDF.

Is it possible to get the metadata from all the images inside a PDF with the PDFBox? if so could anybody refer me to an example? Any examples i've found so far are all for the metadata of the PDF itself and not for the images.

I've also heard that when a PDF is created, it removes any metadata from the objects within, is this true?

Hopefully someone on stackoverflow can help me out.

Basca
  • 1
  • 1
  • 2
  • Duplicate of: http://stackoverflow.com/questions/5653826/how-can-i-extract-images-and-their-metadata-from-pdfs – Erik Jun 01 '11 at 07:03

1 Answers1

2

I don't agree to the others and have a POC for your question: You can extract the XMP Metadata of images using pdfbox in the following way:

public void getXMPInformation() {
    // Open PDF document
    PDDocument document = null;
    try {
        document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
    } catch (IOException e) {
        e.printStackTrace();
    }
    // Get all pages and loop through them
    List pages = document.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator();
    while( iter.hasNext() ) {
        PDPage page = (PDPage)iter.next();
        PDResources resources = page.getResources();            
        Map images = null;
        // Get all Images on page
        try {
            images = resources.getImages();
        } catch (IOException e) {
            e.printStackTrace();
        }
        if( images != null ) {
            // Check all images for metadata
            Iterator imageIter = images.keySet().iterator();
            while( imageIter.hasNext() ) {
                String key = (String)imageIter.next();
                PDXObjectImage image = (PDXObjectImage)images.get( key );
                PDMetadata metadata = image.getMetadata();
                System.out.println("Found a image: Analyzing for Metadata");
                if (metadata == null) {
                    System.out.println("No Metadata found for this image.");
                } else {
                    InputStream xmlInputStream = null;
                    try {
                        xmlInputStream = metadata.createInputStream();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    try {
                        System.out.println("--------------------------------------------------------------------------------");
                        String mystring = convertStreamToString(xmlInputStream);
                        System.out.println(mystring);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                // Export the images
                String name = getUniqueFileName( key, image.getSuffix() );
                    System.out.println( "Writing image:" + name );
                    try {
                        image.write2file( name );
                    } catch (IOException e) {
                        // TODO Auto-generated catch block
                        //e.printStackTrace();
                }
                System.out.println("--------------------------------------------------------------------------------");
            }
        }
    }
}

And the "Helper methods":

public String convertStreamToString(InputStream is) throws IOException {
    /*
     * To convert the InputStream to String we use the BufferedReader.readLine()
     * method. We iterate until the BufferedReader return null which means
     * there's no more data to read. Each line will appended to a StringBuilder
     * and returned as String.
     */
    if (is != null) {
        StringBuilder sb = new StringBuilder();
        String line;

        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
            while ((line = reader.readLine()) != null) {
                sb.append(line).append("\n");
            }
        } finally {
            is.close();
        }
        return sb.toString();
    } else {       
        return "";
    }
}

private String getUniqueFileName( String prefix, String suffix ) {
    /*
    * imagecounter is a global variable that counts from 0 to the number of
    * extracted images
    */
    String uniqueName = null;
    File f = null;
    while( f == null || f.exists() ) {
        uniqueName = prefix + "-" + imageCounter;
        f = new File( uniqueName + "." + suffix );
    }
    imageCounter++;
    return uniqueName;
}

Note: This is a quick and dirty proof of concept and not a well-styled code.

The Images must have XMP-Metadata when placed in InDesign before building the PDF document. The XMP-Metdadata can be set by using Photoshop for example. Please be aware, that p.e. not all IPTC/Exif/... Information is converted into the XMP-Metadata. Only a small number of fields are converted.

I'm using this method on JPG and PNG images, placed in PDFs build with InDesign. It works well and I can get all image-informations after the production-steps from the ready PDFs (picture coating).

Erik
  • 3,777
  • 21
  • 27