Plutext

Posted: **Fri Dec 14, 2012 6:34 pm**

Hello
How i could convert docx to pdf?

I try to use

Code: Select all: new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage)

but output is awful, especially tables(converter can not calculate the width of the columns)
And this conversion requires too much memory

Thanks.

Posted: **Fri Dec 14, 2012 7:30 pm**

There are significant improvements to PDF output which have been contributed and applied on GitHub.

However, I don't think they address your specific issues.

You could try using fixed width columns.

Memory requirements for PDF conversion are dictated in part by FOP. (How long is your document, and how much memory is being used?)

As an alternative, you could try JODConverter.

Posted: **Fri Dec 14, 2012 7:38 pm**

for document wiht 500+ pages and 25-30 tables - 800MB and more then 10 min.

Posted: **Fri Dec 14, 2012 8:38 pm**

It would be interesting to know how that time is split between generating the XSL FO (docx4j's job) and generating the PDF from the XSL FO (FOP's job).

If the docx4j part is slow, it could probably be made faster by code similar to HtmlExporterNonXSLT

You can see how long the FOP part takes, by having docx4j save the XSL FO, and then by running just that through FOP.

Posted: **Fri Dec 14, 2012 8:48 pm**

generating the XSL FO (docx4j's job)

you mean

Code: Select all: ((Conversion) c).setSaveFO(new File(timeStamp+".fo"));

?

Posted: **Fri Dec 14, 2012 9:10 pm**

Yes, that'll give you a file you can feed into FOP independently, to see how long FOP's part of the process takes.

Posted: **Fri Dec 14, 2012 9:28 pm**

this part very quick
problems starts in

Code: Select all: conversion.output(outStream, new PdfSettings())

Posted: **Fri Dec 14, 2012 9:45 pm**

setSaveFO is only configuring that to happen; it actually happens during conversion.output

Posted: **Fri Dec 14, 2012 9:59 pm**

ok, how could i check where this happen?

part of the log

Code: Select all: <fo:block xmlns:fo="http://www.w3.org/1999/XSL/Format" font-family="Calibri" font-size="11.0pt" line-height="115%" space-after="4mm">КТТП</fo:block></w:tc> java.lang.OutOfMemoryError: Java heap space at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:315) at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:310) at java.util.jar.Manifest.read(Manifest.java:178) at java.util.jar.Manifest.<init>(Manifest.java:52) at java.util.jar.JarFile.getManifestFromReference(JarFile.java:165) at java.util.jar.JarFile.getManifest(JarFile.java:146) at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:693) at java.net.URLClassLoader.defineClass(URLClassLoader.java:221) at java.net.URLClassLoader.access$000(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at sun.misc.Launcher$ExtClassLoader.findClass(Launcher.java:229) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at org.apache.log4j.spi.LoggingEvent.<init>(LoggingEvent.java:159) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.error(Category.java:322) at org.docx4j.XmlUtils$LoggingErrorListener.error(XmlUtils.java:1021) at org.apache.xpath.XPath.execute(XPath.java:343) at org.apache.xalan.templates.ElemCopyOf.execute(ElemCopyOf.java:132) at org.apache.xalan.templates.ElemApplyTemplates.transformSelectedNodes(ElemApplyTemplates.java:393) at org.apache.xalan.templates.ElemApplyTemplates.execute(ElemApplyTemplates.java:176) at org.apache.xalan.transformer.TransformerImpl.executeChildTemplates(TransformerImpl.java:2411) at org.apache.xalan.templates.ElemCopy.execute(ElemCopy.java:114) at org.apache.xalan.templates.ElemApplyTemplates.transformSelectedNodes(ElemApplyTemplates.java:393) at org.apache.xalan.templates.ElemApplyTemplates.execute(ElemApplyTemplates.java:176) at org.apache.xalan.transformer.TransformerImpl.executeChildTemplates(TransformerImpl.java:2411) at org.apache.xalan.templates.ElemCopy.execute(ElemCopy.java:114) at org.apache.xalan.templates.ElemApplyTemplates.transformSelectedNodes(ElemApplyTemplates.java:393) at org.apache.xalan.templates.ElemApplyTemplates.execute(ElemApplyTemplates.java:176)

Posted: **Mon Dec 17, 2012 12:25 pm**

For discussion of FOP's memory usage, see http://xmlgraphics.apache.org/fop/1.1/r ... tml#memory

One of these tips is to use multiple page sequences. Docx4j will start a new page sequence each time a sectPr of type other than 'continuous' is encountered in the docx. So if you have manual page breaks, there's a good place to start a new section (next page) .

http://apache-fop.1065347.n5.nabble.com ... 30934.html is also quite interesting.

docx4j's conversion code contains:

Syntax: [ Download ] [ Hide ]

Using java Syntax Highlighting

if(saveFO !=null|| log.isDebugEnabled()){

ByteArrayOutputStream intermediate =newByteArrayOutputStream();

                                Result intermediateResult =new StreamResult(intermediate);

                                XmlUtils.transform(domDoc, xslt, settings.getSettings(), intermediateResult);

String fo = intermediate.toString("UTF-8");

                                log.debug(fo);

if(saveFO !=null){

                                        FileUtils.writeStringToFile(saveFO, fo, "UTF-8");

                                        log.info("Saved "+ saveFO.getPath());
}

                                Source src =new StreamSource(newStringReader(fo));

                                Transformer transformer = XmlUtils.getTransformerFactory().newTransformer();

                                transformer.transform(src, result);
}else{

                                XmlUtils.transform(domDoc, xslt, settings.getSettings(), result);
}
Parsed in 0.016 seconds,  using GeSHi 1.0.8.4

So, for debugging, you can have saveFO or debug enabled, but for production, you don't want that.

Posted: **Mon Dec 17, 2012 3:31 pm**

I ran some experiments on 500 pages from the OpenXML spec

Conclusions:
- it doesn't much matter whether you do the explicit creation of intermediate XSL FO
- xMx 2GB would be enough for those 500 pages (though I used 8GB)
- the docx4j part of the process is slower than the FOP part. replacing the XSLT stuff is an optimisation worth trying
- it is probably worth having more page-sequences (from sectPr); this affects our XSLT performance (ie not just FOP)

Posted: **Mon Dec 17, 2012 7:55 pm**

jason wrote: the docx4j part of the process is slower than the FOP part. replacing the XSLT stuff is an optimisation worth trying

A quick proof of concept of this makes the optimisation look well worth fully implementing: Time on a particular 500 page document for this part of the process went from 132 sec to 17 sec (single run, so probably not JIT), and uses less memory (see Test 8 in the attached docx)

Plutext

PDF Conversion - large document

PDF Conversion - large document

Re: PDF Conversion

Re: PDF Conversion

Re: PDF Conversion

Re: PDF Conversion

Re: PDF Conversion

Re: PDF Conversion

Re: PDF Conversion

Re: PDF Conversion

Re: PDF Conversion

Re: PDF Conversion - large document

Re: PDF Conversion - large document