Page 1 of 1

Exception converting docx, with long pointed list, into pdf

PostPosted: Fri Mar 09, 2018 4:01 am
by bs_dellacqua
Hi all,
I'm usign docx4j 3.3.6;
I have the attached docx file and when I try to convert it into pdf with the following code I've the exception:

public static void main(String[] args) throws Exception {

String regex = null;
PhysicalFonts.setRegex(regex);

String inputfilepath = System.getProperty("user.dir") + "/pdf/InvalidXmlCharacter.docx";
String outputfilepath = System.getProperty("user.dir") + "/pdf/InvalidXmlCharacter.pdf";

WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new File(inputfilepath));

// Refresh the values of DOCPROPERTY fields
FieldUpdater updater = new FieldUpdater(wordMLPackage);
updater.update(true);

// All methods write to an output stream
OutputStream os = new java.io.FileOutputStream(outputfilepath);

System.out.println("Attempting to use XSL FO");

Mapper fontMapper = new IdentityPlusMapper();
wordMLPackage.setFontMapper(fontMapper);

PhysicalFont font = PhysicalFonts.get("Arial Unicode MS");

FOSettings foSettings = Docx4J.createFOSettings();
foSettings.setFoDumpFile(new File(inputfilepath + ".fo"));
foSettings.setWmlPackage(wordMLPackage);

Docx4J.toFO(foSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);

System.out.println("Saved: " + outputfilepath);

}


Exception exporting package: org.docx4j.openpackaging.exceptions.Docx4JException: Exception writing Document to OutputStream: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 13372; Character reference "�" is an invalid XML character.
at org.docx4j.utils.XmlSerializerUtil.serialize(XmlSerializerUtil.java:50)
at org.docx4j.utils.XmlSerializerUtil.serialize(XmlSerializerUtil.java:14)
at org.docx4j.convert.out.fo.renderers.FORendererApacheFOP.render(FORendererApacheFOP.java:209)
at org.docx4j.convert.out.fo.renderers.FORendererApacheFOP.render(FORendererApacheFOP.java:159)
at org.docx4j.convert.out.fo.AbstractFOExporter.postprocess(AbstractFOExporter.java:168)
at org.docx4j.convert.out.fo.AbstractFOExporter.postprocess(AbstractFOExporter.java:47)
at org.docx4j.convert.out.common.AbstractExporter.export(AbstractExporter.java:82)
at org.docx4j.Docx4J.toFO(Docx4J.java:597)
at org.docx4j.Docx4J.toPDF(Docx4J.java:612)
...
Caused by: javax.xml.transform.TransformerException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 13372; Character reference "�" is an invalid XML character.
at org.docx4j.org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:506)
at org.docx4j.utils.XmlSerializerUtil.serialize(XmlSerializerUtil.java:47)
... 94 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 13372; Character reference "�" is an invalid XML character.
at org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1219)
at __redirected.__XMLReaderFactory.parse(__XMLReaderFactory.java:176)
at org.docx4j.org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:489)
... 95 more

The reason of the exception seems to be the presence into docx of a long pointed list (A,B,...,Z,AA); if I remove AA the conversion is executed successfully; this problem doesn't occours if I use a numbered pointed list (1,1.1,1.1.1,2,3,...).

Any idea about how to solve the problem ?

Thanks

Re: Exception converting docx, with long pointed list, into

PostPosted: Wed Mar 14, 2018 6:31 pm
by jason
The FO ends with:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
     <fo:list-block font-size="11.0pt" provisional-distance-between-starts="0.25in" start-indent="0.25in" text-indent="0in">
        <fo:list-item>
          <fo:list-item-label>
            <fo:block font-family="Calibri">&#0;.</fo:block>
          </fo:list-item-label>
          <fo:list-item-body start-indent="body-start()">
            <fo:block line-height="115%" space-after="4mm">
              <inline xmlns="http://www.w3.org/1999/XSL/Format" font-family="Calibri">27</inline>
            </fo:block>
          </fo:list-item-body>
        </fo:list-item>
      </fo:list-block>
      <fo:block font-size="11.0pt" line-height="115%" space-after="4mm" white-space-treatment="preserve"> </fo:block>



    </fo:flow>
  </fo:page-sequence>
</fo:root>
 
Parsed in 0.003 seconds, using GeSHi 1.0.8.4


Where does the � come from?!

It looks like the numbering is X, Y, Z, and then the next number should be AA

Fixed at https://github.com/plutext/docx4j/commi ... 804a59f760 and in https://www.docx4java.org/docx4j/docx4j ... 180314.jar