Converting large amounts of text and dynamic data into PDF

Question

I have a three page Word document that needs to be converted into PDF. This Word document was given to me as a template to show me what the PDF output should look like. I tried converting this document into PDF, created a PDF form and used iTextSharp to open the form, populate it with data and return it back to the client. This is all great but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden.

My second attempt was to create an MVC 2 View without master page, pass the model to the view, take the HTML representation of the View, pass it over to iTextSharp and render the PDF. The problem here was that iTextSharp failed on some tags (one of them was <hr> tag). I managed to get rid of the problematic tag, but then tables were not rendered properly. Namely, the border attribute was ignored so I ended up with borderless tables. That attempt failed.

I need a suggestion or advice on the most efficient way to create a PDF document in MVC 2 which would be maintainable in the long run. I really don't want my actions to be 200+ lines long. Working directly with the Word document is not the best solution as I have never worked with VSTO so I don't quite know what it would look like to open Word and manipulate text inside of it and add dynamic data and then convert that dynamically into PDF.

Any suggestion is highly welcome.

Best regards!

Perhaps not an answer, but something to explore may be pdf.js: https://github.com/andreasgal/pdf.js/ — Justin Beckwith, Aug 09 '11 at 16:43
Hi Justin, thanks for the response. However, this prototype leverages the abilities of HTML 5. The application that I am writing will be available for the public so there are different (read older :) ) browsers that need to be supported. — Huske, Aug 09 '11 at 16:46
I hear you. No matter which language I've dealt with, generating PDFs is the worst. I wish you good luck :-) — Justin Beckwith, Aug 09 '11 at 16:48
The solution at the end was to use iTextSharp and create all three pages, paragraph by paragraph. It was pain but I got the job done and I hope ther won't be any changes to the document. — Huske, Aug 12 '11 at 11:26

score 2 · Accepted Answer · edited May 23 '17 at 12:11

2

One thing that I've done in the past is to save the Word file as a DOCX and unzip it since DOCX is just a renamed zip file. Within the archive open up /word/document.xml and you'll see your document. There's a lot of weird XML tags in there but overall you should get a pretty good idea of where your content is. Then just add placeholder text like {FIRST_NAME}, save the file and re-zip.

Then from code you can just perform the same steps, unzipping with something like SharpZipLib or DotNetZip, swapping placeholder copy, re-zipping and then using very simple Word automation to Save-As a PDF.

The other route is to fully utilize iTextSharp and actually write Paragraphs and PdfPTable and everything else. It takes a lot longer to setup but would give you the most control.

edited May 23 '17 at 12:11

Community

1
1

answered Aug 09 '11 at 18:15

Chris Haas

53,986
12
141
274

+1, Word automation is straight-forward and easy to implement. But there are few problem exists it is very slow and resourse-intensive. – Karthik Mahalingam Aug 09 '11 at 18:39
@Chris, thanks for your suggestion. I'll put some thought into this approach. However, I am a bit affraid that I might end up with iTextSharp and rebuilding the three-page document from the very bottom. Just the thing that I am trying to avoid. – Huske Aug 10 '11 at 08:19
@Huske, building from scratch is not as bad as it sounds once you get used to iTextSharp. It sometimes helps to browse the source code sometimes, too. Definitely ask any questions here if you have them! – Chris Haas Aug 10 '11 at 13:03
@Chris, I am already using iTextSharp for two views. One is based on a PDF form so I am using iTextSharp's AcroForm to populate the form and return it to the user, and another one is building a one page table report. So I am quite comfortable. It is just that I am affraid for maintenance at a later stage regarding this third PDF. Thanks for your help! – Huske Aug 10 '11 at 13:13

score 0 · Answer 2 · answered Aug 09 '11 at 17:08

Q: you say "... but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden" How do you end up having to much data ? If the word template can "hold" the data in 3 pages, they should fit in 3 PDF pages. I used to use iTextSharp to create my PDF's, but I also almost always ended up building the PDF document from scratch myself.(not really a <200 line solution) Have you considerate another library, I recently switched to MigraDoc's PDFSharp.Way simpler to use then iText, lotsa examples / docus

Just my two cents

thanks for your response. I saw PDFSharp much before iTextSharp but the latter proved to be more feature rich. I do admit that it looks easier to program with MigraDoc's solution than with iTextSharp. — Huske, Aug 10 '11 at 08:21

score 0 · Answer 3 · answered Aug 09 '11 at 18:33

Word documents object model is quite easy to understand. It will either contain series of Paragraphs or Tables. Using the Open XML SDK, you can iterate through each paragraph/table in the word document and retrieve it's content and styles. Then you can generate PDF document on the fly using those retrieved information. This will work under MVC too.

But if your word document contains complex elements, then it will take some more time for you to implement based on this approach. Also, this approach would only work with (Word 2007 and 2010) files.

Also, HTML to PDF options currently available in the ITextSharp library would work with only known set of tags, as far as I know.

Another suggestion is to make use of commercially available .NET components. There are lot of good solution available. For ex: Syncfusion

I tried HTML to PDF but it kept failing at some points and it does not recognize border attribute of a table tag. — Huske, Aug 10 '11 at 08:20

Converting large amounts of text and dynamic data into PDF

3 Answers3