Page 1 of 1

docx4j - performance & layout problems

PostPosted: Thu Oct 18, 2012 11:04 pm
by kod_moe
Hello,

My company has a order management platform that has a module to generate dynamically PDF's. For several years we've been using apache fop, but recently we've had a business request to change the layout & content of the PDF's.

We decided to use the MS Word .docx templates that they had sent to us, add placeholders on the document to generate dynamic content and print out a PDF.

We found docx4j and decided to use it.
At first, really simple and intuitive. But now we're finding some problems we can't figure out how to resolve and were expecting the community's help!

We have a Servlet that opens the PDF on the browser.

Code: Select all
public void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
               
                ByteArrayOutputStream output = null;
                try {
                        output = (ByteArrayOutputStream) req.getSession().getAttribute("doc_pdf");
                        resp.setContentType("application/pdf");
                        resp.setContentLength(output.size());
                        resp.getOutputStream().write(output.toByteArray());

                } catch (Exception e) {
                        // do nothing
                } finally {
                        if (output != null) {
                                try {
                                        output.flush();
                                        output.close();
                                } catch (IOException e1) {
                                        logger.warn(e1);
                                }
                        }
                        resp.getOutputStream().flush();
                        req.removeAttribute("doc_pdf");
                }
        }


We have the method called by the Action (we're using Struts) that mapps the variables on our word document to some values.

Code: Select all
public void createPdf(HttpServletRequest request) throws Exception {
                HashMap<String, String> mapping = new HashMap<String, String>();
                mapping.put("VAR_NAME", "Hugo AZEVEDO");

                ByteArrayOutputStream os = (ByteArrayOutputStream) DocxTemplate.FORM_NEW_LUXFIBRE.getInstance().generatePdf(mapping);
                request.getSession().setAttribute("doc_pdf", os);
        }


My PDF generation:

Code: Select all
public OutputStream generatePdf(HashMap<String, String> mapping) throws Exception {
                WordprocessingMLPackage wmlPackage = getTemplate();
                MainDocumentPart documentPart = wmlPackage.getMainDocumentPart();
                String xml = XmlUtils.marshaltoString(wmlPackage.getMainDocumentPart().getJaxbElement(), true);
                Object obj = XmlUtils.unmarshallFromTemplate(xml, mapping);
                documentPart.setJaxbElement((Document) obj);
                Mapper fontMapper = new IdentityPlusMapper();
                wmlPackage.setFontMapper(fontMapper);
                org.docx4j.convert.out.pdf.PdfConversion c = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wmlPackage);
                OutputStream os = new java.io.ByteArrayOutputStream();
                c.output(os, new PdfSettings());
                return os;
        }


Basically, we recover the word template, replace the variables with the values, convert it to pdf and same it to a ByteArrayOutputStream. Then the ByteArrayOutputStream is saved in the session and is recovered by the Servlet to open the PDF on the browser.

I've inspired myself of examples I saw on the forum and the Getting Started pdf.

My word document is 1 page long, it has many tables (and tables within tables), has 1 image, has 1 header and 1 footer. We could consider it as a complex document.

If I click on my GeneratePDF button, it takes around 8 seconds to generate the PDF. That's a lot of time, considering it's instantaneous with apache fop.

After some analysis, I've noticed that what takes that long is the c.output(os, new PdfSettings()); line of code.
I've also noticed some layout inconsistencies and the tables within tables will not appear on the PDF.

If I don't use the template provided by the business and create my own, with the same content but a little simpler (no styles, no tables, etc.) it is faster (around 2seconds - acceptable) but the problems with the layout persist.

There is a problem during the PDF conversion. I'd like to know if you've found similar problems and workarounds to solve them.
Are there any best practices I should respect for the problem I am trying to solve?

Thank you in advance for your help!

Best regards,
Hugo Azevedo

Re: docx4j - performance & layout problems

PostPosted: Fri Oct 19, 2012 4:27 am
by jason
If you want help, you'll have to provide your problematic docx.

You can attach it as a file, or if you consider it senstive, email it to me.

cheers .. Jason