How to extract and save files from PDF in a Django Model

Question

I'm working on a project right now that needs to extract PDFs attached to a Model. The PDFs are then related to the Project as the below models.py:

class Project(models.Model):
   name = models.CharField(max_length=100)
   files = models.FileField('PDF Dataset',
                            help_text='Upload a zip here',
                            null=True)

class Pdf(models.Model):
   name = models.CharField(max_length=100)
   file = models.FileField(null=True)
   project = models.ForeignKey(Project, on_delete=models.CASCADE)

I then have a task I can trigger via Celery to extract the PDF and save each as its own record. My sample tasks.py below:

from django.core.files.base import ContentFile
from celery import shared_task
from zipfile import ZipFile
import re

def extract_pdfs_from_zip(self, project_id: int):
    project = Project.objects.get(pk=project_id)
    ...
    # Start unzipping from here.
    # NOTE: This script precludes that there's no MACOSX shenanigans in the zip file.
    pdf_file_pattern = re.compile(r'.*\.pdf')
    pdf_name_pattern = re.compile(r'.*\/(.*\.pdf)')
    with ZipFile(project.files) as zipfile:
       for name in zipfile.namelist():
           # S2: Check if file is .pdf
           if pdf_file_pattern.match(name):
                pdf_name = pdf_name_pattern.match(name).group(1)
                print('Accessing {}...'.format(pdf_name))
                # S3: Save file as a new Pdf entry
                new_pdf = Pdf.objects.create(name=pdf_name, project=project)
                new_pdf.file.save(ContentFile(zipfile.read(name)),
                                  pdf_name, save=True) # Problem here
                print('New document saved: {}'.format(new_pdf))
           else:
                print('Not a PDF: {}'.format(name))
    return 'Run complete, all PDFs uploaded.'

For some reason though, the part where its saving the document is not outputting a PDF anymore. I know the contents of the original zip so I'm sure they're PDFs. Any ideas how to save the file while retaining its PDF-ness?

Expected result is the PDF being readable. Right now it shows up as corrupted when I open the file. Appreciate your help on this.

score 0 · Accepted Answer · answered Apr 09 '19 at 06:25

0

Oopsie, looks like my zip file has been corrupted by removing the _MACOSX files from it. I did the removing outside of the tasks.py file. See Mac zip compress without __MACOSX folder? for further details.

answered Apr 09 '19 at 06:25

jayg_code

571
8
21

Removing the __MACOSX metadata files shouldn't break the original files though. – AKX Apr 09 '19 at 06:28
@AKX That's what I thought too but it's doing otherwise in my parallel test with another zip file. Unfortunately I'm unable to provide the sample zip file as a) they contain sensitive data, and b) the file size is a tad big. I'll see if I can repeat it with another non-sensitive dataset PDFs. – jayg_code Apr 09 '19 at 23:41

How to extract and save files from PDF in a Django Model

1 Answers1