I'm working on a project right now that needs to extract PDFs attached to a Model. The PDFs are then related to the Project as the below models.py:
class Project(models.Model):
name = models.CharField(max_length=100)
files = models.FileField('PDF Dataset',
help_text='Upload a zip here',
null=True)
class Pdf(models.Model):
name = models.CharField(max_length=100)
file = models.FileField(null=True)
project = models.ForeignKey(Project, on_delete=models.CASCADE)
I then have a task I can trigger via Celery to extract the PDF and save each as its own record. My sample tasks.py below:
from django.core.files.base import ContentFile
from celery import shared_task
from zipfile import ZipFile
import re
def extract_pdfs_from_zip(self, project_id: int):
project = Project.objects.get(pk=project_id)
...
# Start unzipping from here.
# NOTE: This script precludes that there's no MACOSX shenanigans in the zip file.
pdf_file_pattern = re.compile(r'.*\.pdf')
pdf_name_pattern = re.compile(r'.*\/(.*\.pdf)')
with ZipFile(project.files) as zipfile:
for name in zipfile.namelist():
# S2: Check if file is .pdf
if pdf_file_pattern.match(name):
pdf_name = pdf_name_pattern.match(name).group(1)
print('Accessing {}...'.format(pdf_name))
# S3: Save file as a new Pdf entry
new_pdf = Pdf.objects.create(name=pdf_name, project=project)
new_pdf.file.save(ContentFile(zipfile.read(name)),
pdf_name, save=True) # Problem here
print('New document saved: {}'.format(new_pdf))
else:
print('Not a PDF: {}'.format(name))
return 'Run complete, all PDFs uploaded.'
For some reason though, the part where its saving the document is not outputting a PDF anymore. I know the contents of the original zip so I'm sure they're PDFs. Any ideas how to save the file while retaining its PDF-ness?
Expected result is the PDF being readable. Right now it shows up as corrupted when I open the file. Appreciate your help on this.