Page 1 of 1

PDF Conversion - large document

PostPosted: Fri Dec 14, 2012 6:34 pm
by Babak
Hello
How i could convert docx to pdf?

I try to use
Code: Select all
new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage)

but output is awful, especially tables(converter can not calculate the width of the columns)
And this conversion requires too much memory

Thanks.

Re: PDF Conversion

PostPosted: Fri Dec 14, 2012 7:30 pm
by jason
There are significant improvements to PDF output which have been contributed and applied on GitHub.

However, I don't think they address your specific issues.

You could try using fixed width columns.

Memory requirements for PDF conversion are dictated in part by FOP. (How long is your document, and how much memory is being used?)

As an alternative, you could try JODConverter.

Re: PDF Conversion

PostPosted: Fri Dec 14, 2012 7:38 pm
by Babak
for document wiht 500+ pages and 25-30 tables - 800MB and more then 10 min.

Re: PDF Conversion

PostPosted: Fri Dec 14, 2012 8:38 pm
by jason
It would be interesting to know how that time is split between generating the XSL FO (docx4j's job) and generating the PDF from the XSL FO (FOP's job).

If the docx4j part is slow, it could probably be made faster by code similar to HtmlExporterNonXSLT

You can see how long the FOP part takes, by having docx4j save the XSL FO, and then by running just that through FOP.

Re: PDF Conversion

PostPosted: Fri Dec 14, 2012 8:48 pm
by Babak
generating the XSL FO (docx4j's job)

you mean
Code: Select all
((Conversion) c).setSaveFO(new File(timeStamp+".fo"));

?

Re: PDF Conversion

PostPosted: Fri Dec 14, 2012 9:10 pm
by jason
Yes, that'll give you a file you can feed into FOP independently, to see how long FOP's part of the process takes.

Re: PDF Conversion

PostPosted: Fri Dec 14, 2012 9:28 pm
by Babak
this part very quick
problems starts in
Code: Select all
conversion.output(outStream, new PdfSettings())

Re: PDF Conversion

PostPosted: Fri Dec 14, 2012 9:45 pm
by jason
setSaveFO is only configuring that to happen; it actually happens during conversion.output

Re: PDF Conversion

PostPosted: Fri Dec 14, 2012 9:59 pm
by Babak
ok, how could i check where this happen?

part of the log
Code: Select all
  <fo:block xmlns:fo="http://www.w3.org/1999/XSL/Format" font-family="Calibri" font-size="11.0pt" line-height="115%" space-after="4mm">КТТП</fo:block></w:tc>
java.lang.OutOfMemoryError: Java heap space
        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:315)
        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:310)
        at java.util.jar.Manifest.read(Manifest.java:178)
        at java.util.jar.Manifest.<init>(Manifest.java:52)
        at java.util.jar.JarFile.getManifestFromReference(JarFile.java:165)
        at java.util.jar.JarFile.getManifest(JarFile.java:146)
        at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:693)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:221)
        at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at sun.misc.Launcher$ExtClassLoader.findClass(Launcher.java:229)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
        at org.apache.log4j.spi.LoggingEvent.<init>(LoggingEvent.java:159)
        at org.apache.log4j.Category.forcedLog(Category.java:391)
        at org.apache.log4j.Category.error(Category.java:322)
        at org.docx4j.XmlUtils$LoggingErrorListener.error(XmlUtils.java:1021)
        at org.apache.xpath.XPath.execute(XPath.java:343)
        at org.apache.xalan.templates.ElemCopyOf.execute(ElemCopyOf.java:132)
        at org.apache.xalan.templates.ElemApplyTemplates.transformSelectedNodes(ElemApplyTemplates.java:393)
        at org.apache.xalan.templates.ElemApplyTemplates.execute(ElemApplyTemplates.java:176)
        at org.apache.xalan.transformer.TransformerImpl.executeChildTemplates(TransformerImpl.java:2411)
        at org.apache.xalan.templates.ElemCopy.execute(ElemCopy.java:114)
        at org.apache.xalan.templates.ElemApplyTemplates.transformSelectedNodes(ElemApplyTemplates.java:393)
        at org.apache.xalan.templates.ElemApplyTemplates.execute(ElemApplyTemplates.java:176)
        at org.apache.xalan.transformer.TransformerImpl.executeChildTemplates(TransformerImpl.java:2411)
        at org.apache.xalan.templates.ElemCopy.execute(ElemCopy.java:114)
        at org.apache.xalan.templates.ElemApplyTemplates.transformSelectedNodes(ElemApplyTemplates.java:393)
        at org.apache.xalan.templates.ElemApplyTemplates.execute(ElemApplyTemplates.java:176)

Re: PDF Conversion

PostPosted: Mon Dec 17, 2012 12:25 pm
by jason
For discussion of FOP's memory usage, see http://xmlgraphics.apache.org/fop/1.1/r ... tml#memory

One of these tips is to use multiple page sequences. Docx4j will start a new page sequence each time a sectPr of type other than 'continuous' is encountered in the docx. So if you have manual page breaks, there's a good place to start a new section (next page) .

http://apache-fop.1065347.n5.nabble.com ... 30934.html is also quite interesting.

docx4j's conversion code contains:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
                        if (saveFO != null || log.isDebugEnabled()) {

                                ByteArrayOutputStream intermediate = new ByteArrayOutputStream();
                                Result intermediateResult = new StreamResult(intermediate);

                                XmlUtils.transform(domDoc, xslt, settings.getSettings(), intermediateResult);

                                String fo = intermediate.toString("UTF-8");
                                log.debug(fo);

                                if (saveFO != null) {
                                        FileUtils.writeStringToFile(saveFO, fo, "UTF-8");
                                        log.info("Saved " + saveFO.getPath());
                                }

                                Source src = new StreamSource(new StringReader(fo));

                                Transformer transformer = XmlUtils.getTransformerFactory().newTransformer();
                                transformer.transform(src, result);
                        } else {

                                XmlUtils.transform(domDoc, xslt, settings.getSettings(), result);
                        }
 
Parsed in 0.016 seconds, using GeSHi 1.0.8.4


So, for debugging, you can have saveFO or debug enabled, but for production, you don't want that.

Re: PDF Conversion - large document

PostPosted: Mon Dec 17, 2012 3:31 pm
by jason
I ran some experiments on 500 pages from the OpenXML spec

Conclusions:
- it doesn't much matter whether you do the explicit creation of intermediate XSL FO
- xMx 2GB would be enough for those 500 pages (though I used 8GB)
- the docx4j part of the process is slower than the FOP part. replacing the XSLT stuff is an optimisation worth trying
- it is probably worth having more page-sequences (from sectPr); this affects our XSLT performance (ie not just FOP)

Re: PDF Conversion - large document

PostPosted: Mon Dec 17, 2012 7:55 pm
by jason
jason wrote: the docx4j part of the process is slower than the FOP part. replacing the XSLT stuff is an optimisation worth trying


A quick proof of concept of this makes the optimisation look well worth fully implementing: Time on a particular 500 page document for this part of the process went from 132 sec to 17 sec (single run, so probably not JIT), and uses less memory (see Test 8 in the attached docx)