Page 1 of 1

Problems while converting HTML entities

PostPosted: Tue Sep 11, 2012 1:20 am
by Empirica
Hi everyone,

There seem to be some issues with HTML entities (e.g. " " or "–") while converting XHTML to Docx objects. The behaviour can easily be re-produced when editing the ConvertInXHTMLFragment sample (by simpy adding one of the HTML entities to the HTML code).

The error message looks like this:

Code: Select all
org.docx4j.org.xhtmlrenderer.exception WARNING:: Unhandled exception. Can't load the XML resource (using TRaX transformer). org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 20; Entität "ndash" wurde referenziert aber nicht deklariert.
Exception in thread "main" org.docx4j.openpackaging.exceptions.Docx4JException: issues at Line 1, Col 20
   at org.docx4j.convert.in.xhtml.XHTMLImporter.convert(XHTMLImporter.java:396)
   at org.docx4j.samples.ConvertInXHTMLFragment.main(ConvertInXHTMLFragment.java:45)
Caused by: javax.xml.transform.TransformerException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 20; Entität "ndash" wurde referenziert aber nicht deklariert.
   at org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:502)
   at org.docx4j.org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.createXMLResource(XMLResource.java:190)
   at org.docx4j.org.xhtmlrenderer.resource.XMLResource.load(XMLResource.java:75)
   at org.docx4j.convert.in.xhtml.XHTMLImporter.convert(XHTMLImporter.java:386)
   ... 1 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 20; Entität "ndash" wurde referenziert aber nicht deklariert.
   at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
   at org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:485)
   ... 4 more



Any help is welcome, thanks!

Re: Problems while converting HTML entities

PostPosted: Tue Sep 11, 2012 7:51 pm
by jason
Entities like   which aren't built in to XML need to be handled appropriately.

XHTMLImporter contains various convert methods you can invoke, but most of these (including the one used by ConvertInXHTMLFragment) use https://github.com/plutext/flyingsaucer ... ource.java

Looking at that, the entities defined in https://github.com/plutext/flyingsaucer ... es/schema/ should be available.

Unfortunately however, they didn't make it into org\docx4j\xhtmlrenderer\1.0.0\xhtmlrenderer-1.0.0.jar

So a variety of possible workarounds:

- easiest: replace your named entities with the corresponding character entities before you start; see http://www.ibm.com/developerworks/xml/l ... -entities/

- build the xhtmlrenderer jar with all the resources included

- try using a DTD with an internal subset?

- use the signature public static List<Object> convert(Node node, String baseUrl, WordprocessingMLPackage wordMLPackage) throws Docx4JException which gives you control of the parsing process

Re: Problems while converting HTML entities

PostPosted: Tue Sep 11, 2012 11:29 pm
by Empirica
Thanks jason,

I think I will go for the 'parsing before transforming' solution and unsecape all HTML entities (http://commons.apache.org/lang/api-2.6/org/apache/commons/lang/StringEscapeUtils.html#unescapeHtml%28java.lang.String%29).

Best regards!