Page 1 of 1

Character encoding of word docs

PostPosted: Tue Oct 26, 2010 4:11 am
by dcole
Jason,

Does the issue of character encoding need to be brought up here for when you are loading a WordprocessingMLpackage?

I have a bunch of strings that have been saved into my java program that were scavenged from a word document. I am trying to do some date parsing on some of thee strings, and characters such as "En Dash" are really giving me trouble. I can't seem to figure out how to do the conversion correctly in my java program to parse a string in my word document that looks like "08/2005 - present" and in my java program looks like "08/2005 (garbage) present"

Any tips on how to handle this?

Re: Character encoding of word docs

PostPosted: Tue Oct 26, 2010 8:05 am
by jason
http://www.documentinteropinitiative.or ... 1-2.4.aspx says
The document character set shall conform to the Unicode Standard and ISO/IEC 10646-1, with either the UTF-8 or UTF-16 encoding form, as required by the XML 1.0 standard.


http://www.documentinteropinitiative.or ... 8.1.4.aspx says:
All XML content of the parts defined in this Open Packaging specification shall conform to the following validation rules:

1. XML content shall be encoded using either UTF-8 or UTF-16. If any part includes an encoding declaration, as defined in ยง4.3.3 of the XML 1.0 specification, that declaration shall not name any encoding other than UTF-8 or UTF-16. Package implementers shall enforce this requirement upon creation and retrieval of the XML content.


When docx4j marshalls, it uses UTF-8, the JAXB default - http://download.oracle.com/docs/cd/E178 ... aller.html

So if you are creating w:r/w:t content as UTF-8, things should work.

cheers .. Jason