Page 1 of 1

Encoding problem

PostPosted: Sat Feb 27, 2010 1:11 am
by fkfausa
I have the following code:
Code: Select all
        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File("c:\\byggestr.docx"));
        MainDocumentPart docPart = wordMLPackage.getMainDocumentPart();
        Document doc = XmlUtils.marshaltoW3CDomDocument(docPart.getJaxbElement());
        Document data = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse("c:\\byggestr.xml");
        modifyTableRows(doc, data);
        docPart.setJaxbElement((org.docx4j.wml.Document)XmlUtils.unmarshal(doc));
        CustomXmlDataStoragePart customXmlDataStoragePart = (CustomXmlDataStoragePart) wordMLPackage.getCustomXmlDataStorageParts().get("{A61EE2AE-F111-4B83-8C9C-F3BECEA1AD12}");
        CustomXmlDataStorage customXmlDataStorage = customXmlDataStoragePart.getData();
        printNode(data, "");
        customXmlDataStorage.setDocument(data);
        wordMLPackage.save(new java.io.File("c:\\fkf.docx"));


The printNode subrutine correctly prints the xml of the data with the correct encoding (ISO-8859-1)

The xml file "c:\byggestr.xml" contains "<?xml version="1.0" encoding="ISO-8859-1"?>" and it contains norwegian special characters and is saved in ANSI format, but when "customXmlDataStorage.setDocument(data)" is used, the resulting Word document, when opening "c:\\fkf.docx" with Word, does not show the norwegian special characters the correct way. Any ideas on how this can be rectified?

Frode

Re: Encoding problem

PostPosted: Sat Feb 27, 2010 9:26 am
by jason
I'm not clear on exactly what your code is doing.

But its your method modifyTableRows which is injecting the Norwegian characters into the main document part, right (I notice you aren't actually calling customXmlDataStorage.applyBindings?)? If so, I think you'd need to be converting from ISO-8859-1 to UTF-8 in there.

See for example http://stackoverflow.com/questions/6521 ... -8-in-java

Per the spec, each part in the docx file must be in UTF-8 or UTF-16: http://openiso.org/Ecma/376/Part2/8.1.4

Per http://java.sun.com/webservices/docs/1. ... aller.html docx4j is marshalling using UTF-8 encoding

Hope this helps .. Jason

Re: Encoding problem

PostPosted: Tue Mar 02, 2010 2:01 am
by fkfausa
Thanks Jason!

Solved by making sure utf-8 was used all over the place.

The code expands table(s) to allow the correct number of rows to be created, based on the data file, with bindings and all.

The resulting docx is fine, but when I convert this to pdf, the table(s) disappear. Doesn't pdf conversion via XSLFO handle tables at all?
In addition the CustomXMLPart only affects the produced docx file, not the converted pdf file. All my SDT's are gone. Is this the expected behaviour?
A similar thing happens when converting to html. Only the "original" xml data is shown, not the new data read in from the data file. In this case the correct number of table rows are shown, but not the correct data set. Is this the expected behaviour?

Frode

Re: Encoding problem

PostPosted: Tue Mar 02, 2010 7:42 am
by jason
fkfausa wrote:The resulting docx is fine, but when I convert this to pdf, the table(s) disappear. Doesn't pdf conversion via XSLFO handle tables at all?


Tables should be fine. If you send me the docx I'll take a look.

If there are problems, its likely to be some unexpected element such as a bookmark - I recently fixed this in the HTML case, but haven't gotten around to fixing it in the XSL FO case, I don't think.

fkfausa wrote:A similar thing happens when converting to html. Only the "original" xml data is shown, not the new data read in from the data file. In this case the correct number of table rows are shown, but not the correct data set. Is this the expected behaviour?


Are you calling customXmlDataStorage.applyBindings?

cheers .. Jason

Re: Encoding problem

PostPosted: Tue Mar 02, 2010 11:57 pm
by fkfausa
Hi Jason!

Indeed my document contained bookmarks, but removing them did not solve any problem. Using applyBindings did not help either. If I put the call to applyBindings before the save call, then the docx becomes corrupted. The html is produced without errors when applyBindings is not used. The pdf fails. When applyBindings is used, both the html and the pdf fails. I have attached my code (a maven project) and the template file, byggestrnew.docx, and the data file, byggestr.xml, in the attached zip file. Any pointers on how to handle this would be appreciated.
My goal is to use Word 2007 and sdt's to build templates, utilizing xml data defined within the template. Then in production replace the xml from external data, expanding any tables that needs to be expanded based on the data, and produce either pdf, html or text files.

Frode

Re: Encoding problem

PostPosted: Wed Mar 03, 2010 9:09 am
by jason
There was a small bug in the pdf via xsl fo stylesheet, which I have now fixed:

http://dev.plutext.org/trac/docx4j/changeset/1099

Also, I've made a temporary fix for an NPE in the FO output: http://dev.plutext.org/trac/docx4j/changeset/1100

I'll have a look later to understand why that was happening.

With these changes, your PDF generates. However, each added row contains the same data :-(

Re: Encoding problem

PostPosted: Wed Mar 03, 2010 7:10 pm
by jason
So, with http://dev.plutext.org/trac/docx4j/changeset/1101 your code works :-)

I'm not clear on how you know which w:r to apply the binding to. (What if there are 2 in the SdtContent?) I had previously thought it had to satisfy "w:rPr/w:rStyle/@w:val='Entry' or w:rPr/w:rStyle/@w:val='PlaceholderText'", but evidently this is incorrect.

If anyone knows the correct behaviour, please enlighten me! Thanks.

Re: Encoding problem

PostPosted: Thu Mar 04, 2010 8:57 pm
by fkfausa
Thanks Jason for all your help and effort regarding my use of docx4j.

Regards, Frode

Re: Encoding problem

PostPosted: Fri Mar 05, 2010 10:14 pm
by jason
My pleasure. I'm sorry the patches were necessary to make things work for you, but the custom xml mapping stuff is one of the less commonly used / more advanced parts. So thanks for the test case and helping to make docx4j better :-)

Re: Encoding problem

PostPosted: Tue Mar 09, 2010 12:29 am
by fkfausa
Ji Jason!

How do I get hold of the change sets you have created during the course of this post? When I click "nightly build" on your web site (http://dev.plutext.org/trac/docx4j) I get an error, and I can't seem to find any other places to download stuff created in march. Also when I donwload your source code from the trunk and do a "mvn -Pjdk16 install" I get errors as:

[INFO] Compilation failure

<some-path>\src\main\java\org\docx4j\openpackaging\contenttype\ContentTypeManager.java:[70,40] cannot find symbol
symbol : class PresentationMLPackage
location: package org.docx4j.openpackaging.packages

<some-path>\src\main\java\org\docx4j\openpackaging\contenttype\ContentTypeManager.java:[80,52] package org.docx4j.openpackaging.parts.PresentationML does not exist

<some-path>\src\main\java\org\docx4j\jaxb\NamespacePrefixMapperSunInternal.java:[30,90] package com.sun.xml.internal.bind.marshaller does not exist

<some-path>\src\main\java\org\docx4j\jaxb\NamespacePrefixMapperRelationshipsPartSunInternal.java:[23,107] package com.sun.xml.internal.bind.marshaller does not exist

<some-path>\src\main\java\org\docx4j\convert\in\FlatOpcXmlImporter.java:[53,40] cannot find symbol
symbol : class PresentationMLPackage
location: package org.docx4j.openpackaging.packages

<some-path>\src\main\java\org\docx4j\convert\in\FlatOpcXmlImporter.java:[57,52] package org.docx4j.openpackaging.parts.PresentationML does not exist

<some-path>\src\main\java\org\docx4j\openpackaging\contenttype\ContentTypeManager.java:[350,10] cannot find symbol
symbol : variable JaxbPmlPart
location: class org.docx4j.openpackaging.contenttype.ContentTypeManager

<some-path>\src\main\java\org\docx4j\openpackaging\contenttype\ContentTypeManager.java:[700,11] cannot find symbol
symbol : class PresentationMLPackage
location: class org.docx4j.openpackaging.contenttype.ContentTypeManager

<some-path>\src\main\java\org\docx4j\jaxb\NamespacePrefixMapperUtils.java:[29,16] incompatible types
found : org.docx4j.jaxb.NamespacePrefixMapperSunInternal
required: java.lang.Object

<some-path>\src\main\java\org\docx4j\jaxb\NamespacePrefixMapperUtils.java:[50,16] incompatible types
found : org.docx4j.jaxb.NamespacePrefixMapperRelationshipsPartSunInternal
required: java.lang.Object

<some-path>\src\main\java\org\merlin\io\DOMSerializerEngine.java:[111,69] com.sun.org.apache.xerces.internal.util.EncodingMap is Sun proprietary API and may be removed in a future release

<some-path>\src\main\java\org\docx4j\convert\in\FlatOpcXmlImporter.java:[146,24] cannot find symbol
symbol : class PresentationMLPackage
location: class org.docx4j.convert.in.FlatOpcXmlImporter

<some-path>\src\main\java\org\docx4j\convert\in\FlatOpcXmlImporter.java:[437,31] cannot find symbol
symbol : class JaxbPmlPart
location: class org.docx4j.convert.in.FlatOpcXmlImporter

<some-path>\src\main\java\org\docx4j\convert\in\FlatOpcXmlImporter.java:[441,86] package org.pptx4j.jaxb does not exist

Re: Encoding problem

PostPosted: Tue Mar 09, 2010 1:47 pm
by jason
Use SVN tip (which I think you have already downloaded), and build it with ant, or set it up in your IDE (Eclipse has sonatype's m2eclipse, which makes this easy).

I've only really set up maven to do the dependency management. It's probably not hard to make it do the build, but I haven't gotten around to that. If you'd like to make the necessary changes, I'll incorporate them.

Basically, you need the following source dirs:

Code: Select all
            <src path="src/main/java"/>
            <src path="src/diffx"/>
            <src path="src/pptx4j/java"/>
            <src path="src/svg"/>


And you'll need both the jaxb included in Java 6, *and* the JAXB 2.1.x reference implementation.

Re: Encoding problem

PostPosted: Sun Mar 21, 2010 1:39 pm
by jason
I've made http://dev.plutext.org/docx4j/docx4j-ni ... 100321.jar
which you can use in conjunction with the other jars in http://dev.plutext.org/docx4j/docx4j-2.3.0/