Page 1 of 1

PDF is different from the source docx file

PostPosted: Tue Jan 27, 2015 12:21 pm
by ctauken
I have been working on converting a docx file to pdf. The conversion runs successfully, however the output in the pdf file is not the same:

- Looks like it is a different font
- the pdf is one less page
- the spacing seems to be different in the pdf
- horizontal rules are not converted
- the header and footers are messed up
- the table of contents is different

There was an issue with converting the file, so I needed to set the PP_PDF_APACHEFOP_DISABLE_PAGEBREAK_LIST_ITEM property. I am not sure if this is messing up the design, but this was the only way I was able to get the conversion to work.

I have included the files in question.

Please let me know what can be done to fix the issues.

Vendor Remote Access.docx
Input file
(58 KiB) Downloaded 399 times

Vendor Remote Access.pdf
Output file
(175.35 KiB) Downloaded 444 times


Thanks

Re: PDF is different from the source docx file

PostPosted: Fri Jan 30, 2015 7:56 pm
by jason
Your docx gave:

Code: Select all
INFO org.apache.fop.apps.FOUserAgent .processEvent line 83 - An fo:table  (See position 445:446) is wider than the available room in inline-progression-dimension. Adjusting end-indent based on overconstrained geometry rules (XSL 1.1, ch. 5.3.4)
Exception in thread "main" org.docx4j.openpackaging.exceptions.Docx4JException: Exception exporting package; FOP https://issues.apache.org/bugzilla/show_bug.cgi?id=54094 .. try PP_APACHEFOP_DISABLE_PAGEBREAK_LIST_ITEM
   at org.docx4j.convert.out.common.AbstractExporter.export(AbstractExporter.java:90)
   at org.docx4j.Docx4J.toFO(Docx4J.java:466)
   at org.docx4j.samples.ConvertOutPDF.main(ConvertOutPDF.java:183)
Caused by: java.lang.IllegalArgumentException: Only non-null Positions with an index can be checked
   at org.apache.fop.layoutmgr.AbstractLayoutManager.verifyNonNullPosition(AbstractLayoutManager.java:309)
   at org.apache.fop.layoutmgr.AbstractLayoutManager.isFirst(AbstractLayoutManager.java:321)
   at org.apache.fop.layoutmgr.list.ListItemContentLayoutManager.addAreas(ListItemContentLayoutManager.java:136)
   at org.apache.fop.layoutmgr.list.ListItemLayoutManager.addAreas(ListItemLayoutManager.java:542)
   at org.apache.fop.layoutmgr.list.ListBlockLayoutManager.addAreas(ListBlockLayoutManager.java:184)
   at org.apache.fop.layoutmgr.AreaAdditionUtil.addAreas(AreaAdditionUtil.java:113)
   at org.apache.fop.layoutmgr.FlowLayoutManager.addAreas(FlowLayoutManager.java:364)
   at org.apache.fop.layoutmgr.PageBreaker.addAreas(PageBreaker.java:285)
   at org.apache.fop.layoutmgr.AbstractBreaker.addAreas(AbstractBreaker.java:607)
   at org.apache.fop.layoutmgr.AbstractBreaker.addAreas(AbstractBreaker.java:481)
   at org.apache.fop.layoutmgr.PageBreaker.doPhase3(PageBreaker.java:313)
   at org.apache.fop.layoutmgr.AbstractBreaker.doLayout(AbstractBreaker.java:436)
   at org.apache.fop.layoutmgr.PageBreaker.doLayout(PageBreaker.java:90)
   at org.apache.fop.layoutmgr.PageSequenceLayoutManager.activateLayout(PageSequenceLayoutManager.java:113)
   at org.apache.fop.area.AreaTreeHandler.endPageSequence(AreaTreeHandler.java:267)
   at org.apache.fop.fo.pagination.PageSequence.endOfNode(PageSequence.java:128)
   at org.apache.fop.fo.FOTreeBuilder$MainFOHandler.endElement(FOTreeBuilder.java:347)
   at org.apache.fop.fo.FOTreeBuilder.endElement(FOTreeBuilder.java:181)
   at org.apache.xalan.transformer.TransformerIdentityImpl.endElement(TransformerIdentityImpl.java:1102)
   at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
   at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
   at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
   at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
   at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
   at org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:485)
   at org.docx4j.convert.out.fo.renderers.FORendererApacheFOP.render(FORendererApacheFOP.java:211)
   at org.docx4j.convert.out.fo.renderers.FORendererApacheFOP.render(FORendererApacheFOP.java:158)
   at org.docx4j.convert.out.fo.AbstractFOExporter.postprocess(AbstractFOExporter.java:140)
   at org.docx4j.convert.out.fo.AbstractFOExporter.postprocess(AbstractFOExporter.java:1)
   at org.docx4j.convert.out.common.AbstractExporter.export(AbstractExporter.java:81)
   ... 2 more


until, as you say, you set PP_APACHEFOP_DISABLE_PAGEBREAK_LIST_ITEM as suggested.

But that doesn't explain everything you note.

the header and footers are messed up


the header - looks like it is being justified for some reason; fixable no doubt

your footers use tabs for central and right alignment; tabs are problematic in XSL FO. tables with invisible borders work better.

The TOC is also problematic because of tabs.

By the way, there is an alternative approach to PDF output in the works. I've attached the result FYI. As you can see, the output is much closer to your input (ie tabs aren't a problem with this other approach). (Ignore the little red debug markers, and the temporary lines corresponding to manual page breaks) However, its going to be commercial, not open source. If you'd like to enquire further about that, please contact me off list.

Re: PDF is different from the source docx file

PostPosted: Sat Jan 31, 2015 11:25 am
by ctauken
Thanks Jason.

What can I do to avoid using the PP_APACHEFOP_DISABLE_PAGEBREAK_LIST_ITEM flag? Is there something specific in the MS Word file I can change in order to avoid this issue?