Page 1 of 1

Converting docx to PDF not Preserving Whitespace

PostPosted: Fri Dec 07, 2018 5:19 am
by alex_docx_man
Hi,

I am trying to convert a docx to a PDF using the Docx4J.toFO() method. Everything works as expected except for the treatment of white spaces. Any amount of spaces is truncated to a single space, and the resulting PDF loses format. It is essential that the PDF retain the white spaces. I have spent a LOT of time doing research and am fairly confident that, with the current settings, the white spaces should be preserved, but they aren't so now I'm here. I will now go through a detailed explanation of all the steps I take to produce the PDF in order to hopefully shed some light for more experienced people as to where my problem lies.

First, the docx is created by using a mailmerge (output = org.docx4j.model.fields.merge.MailMerger.getConsolidatedResultCrude(wordMLPackage, data);). I have a docx template that gets merged with some text (since the endgame functionality is having many documents with different text but the same header). The docx template is attached here and is called "Docx_Template.docx". The merge works by parsing an incoming text file with a .index ending (Incoming_Text.index), the file contains a key value pair, the key being the same as the mail merge field in the template: mfcpty. The text is then mapped to the field successfully and the resulting docx looks exactly what it's supposed to look like. The final docx is called "Incoming_Text.docx" and is attached.

Next, the docx is sent to the code that handles the conversion to PDF. Here is a snippet of the code that does the conversion:

Code: Select all
wxmlPackage = WordprocessingMLPackage.load(convFile);

IdentityPlusMapper fontMapper = new IdentityPlusMapper();
wxmlPackage.setFontMapper(fontMapper);

PhysicalFonts.discoverPhysicalFonts();
FOSettings foSettings = Docx4J.createFOSettings();
foSettings.setFoDumpFile(new java.io.File("foSettings.xml"));

foSettings.setWmlPackage(wxmlPackage);

OutputStream os = new java.io.FileOutputStream(pdfFileName);

Docx4J.toFO(foSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);


The resulting pdf is called Output.pdf and is attached. As you can see, all white space is not preserved. In order to debug I dumped the intermediate FOSettings file. Upon inspection it clearly shows in multiple places the attributes: white-space-collapse="false" as well as white-space="pre". I did some research on this and learnt that some people had problems with the white space attribute white-space-treatment="preserve" not doing it's job. This was the attribute that was all over my FOSettings (FOSettings.xml) and I went as far as to alter the docx4j and docx4j-export-fo jars in order to change the way the PDF is created. I managed to successfully change the resulting FO settings file to the white-space attributes that you see in the attached file. I am stumped as to why white space is not being preserved even tho it so clearly says everywhere that it will. Any help is greatly appreciated.

Thank you

Edit: For some more clarification, if I were to add a sufficient amount of spaces between the words, those spaces will also truncate and disappear.
Also, the FOSettings file that is attached is the one that is produced AFTER I altered the docx4j jars, the original is slightly different, but the resulting output.pdf is identical.

Re: Converting docx to PDF not Preserving Whitespace

PostPosted: Fri Dec 07, 2018 9:46 am
by jason
I managed to successfully change the resulting FO settings file to the white-space attributes that you see in the attached file. I am stumped as to why white space is not being preserved even tho it so clearly says everywhere that it will. Any help is greatly appreciated.


First, to check assumptions: Does the original intermediate FO output contain all your white space, or is some missing by then?

If it does contain the expected whitespace, its a manner of fiddling with the @white* to achieve your desired results. You can change those attribute values in your fo file, then feed it directly into FOP (ie without docx4j) to get a PDF.

Once you know what values need to be changed, that could be adjusted at the docx4j end.

Since you have XSL FO you think should be correct, maybe you could ask on the FOP mailing list.

I notice we have stuff like :

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
                                                                <inline font-size="9.0pt">
                                                                        <block line-height="0pt" linefeed-treatment="preserve"
                                                                                white-space-collapse="false">
                                                                        </block>
                                                                        <inline font-family="Courier New" white-space="pre"
                                                                                white-space-collapse="false">
                                                                        </inline>
                                                                </inline>
 
Parsed in 0.001 seconds, using GeSHi 1.0.8.4


(pretty printed) ie block in inline. Should look into that...

Re: Converting docx to PDF not Preserving Whitespace

PostPosted: Sat Dec 08, 2018 2:23 am
by alex_docx_man
Yes, the original intermediate FO settings file contains all the white space. Also maybe I should mention that using Docx4J.toPDF() results in the same white space truncation.

What do you mean by " @white* "?

I will try to set up some code that feeds the FOSettings file and spits out a PDF in order to be able to experiment with the settings file. Are there any changes to the XSL FO settings file that you recommend? Since I've pretty much littered it with white-space-collapse="false" and white-space="pre".

Thank you

Re: Converting docx to PDF not Preserving Whitespace

PostPosted: Sat Dec 08, 2018 3:59 pm
by jason
@white* was just shorthand for the various whitespace attributes.

I'd suggest you start again with the original docx4j .fo output, then remove stuff so that you have just a single fo:block you want to get working correctly.

You can then do .fo to pdf, without writing any code. per https://xmlgraphics.apache.org/fop/2.3/running.html

Code: Select all
  fop foo.fo foo.pdf
  fop -fo foo.fo -pdf foo.pdf (does the same as the previous line)


Then play with the attributes until your output looks correct :-)

Re: Converting docx to PDF not Preserving Whitespace

PostPosted: Thu Dec 13, 2018 7:03 am
by alex_docx_man
So this is a bit of an update. I've managed to get the .fo settings to result in the desired output. Now it's a matter of changing the docx4j code in order to get that same output. I'm fairly close but I'm stuck on one last thing. There was a <block> element that was giving me troubles and removing it seemed to fix the spacing perfectly. The format of the original .fo file is like this:
Code: Select all
....
</block><inline> Text is going here </inline> <block attribute="value">
</block><inline> More text here         with spaces</inline> <block attribute="value">
</block> ..... and so on...


What fixed the white space problem was updating a parent block to have some more attributes (which I've managed to replicate in docx4j), and also removing the <block></block> element from the example above (which I'm having troubles with). This is what it needs to look like:

Code: Select all
<inline> Text is going here </inline>
<inline> More text here         with spaces</inline>
...


However I'm finding it difficult to replicate this in docx4j... I've found the code that handles the <block> </block> after the inline. It's the else statement in BrWriter.java located in the package org.docx4j.convert.out.fo inside docx4j-export-fo-6.0.1.jar. Commenting out everything in this else statement (line 56) successfully removes the <block> </block> element from the .fo settings BUT it removes the new line that was associated with it. So the .fo settings file looks like this:

Code: Select all
<inline> Text is going here </inline><inline> More text here         with spaces</inline><inline> Continues on like this </inline>


This affects the resulting pdf as all the text is on one line. It seems as if calling .setTextContent("\n) on the <block> element has the behavior of making the block element look like:
Code: Select all
<block>
</block>


Would you know how to go about getting the behavior above (an <inline> </inline> pair on each new line)? I tried calling setTextContent("\n") on the <inline> element but it made no difference.

Ideal .fo settings file is attached and is called "foSettings_IDEAL.xml"
The closest .fo settings file I could get to the ideal is called "foSettings.xml"
They are saved as .xml because it doesn't let me upload .fo

Thank you.

Re: Converting docx to PDF not Preserving Whitespace

PostPosted: Tue Dec 18, 2018 8:13 pm
by jason
Can you make a short docx file (ie 1 or 2 paragraphs) which produces the problematic fo blocks you've identified?