Page 1 of 1

docx4j generates invalid paraId and textId

PostPosted: Wed May 13, 2015 8:28 pm
by markusglue
We have the following scenario:

1) Several docx files are processed by the MailMerger to fill some MergeFields.

2 a) On a first approach, we then put the main document part content Objects together to perform a manual concatenation merge.
2 b) On a second approach, we tried the enterprise edition to perform the merge.

Both a) and b) generate invalid hexBinary for paraId and textId of a paragraph.
The approach a) additionally duplicated bookmarkStart and bookmarkEnd ID's, that should be unique.
The validation against Word 2010 with OpenXML Productivity Tool stated the problems above.

As a result, we are currently not sure how to guarantee a valid, non corrupt docx file. We encountered in both situations "corrupt" docx files.
The introduction of the enterprise trial edition improved the situation, but the watermarks itself and a numbering generate the hexBinary problem.

Can the invalid hexBinary id's cause a corrupt docx and make the document not open in Word ?
Is there a different behaviour possible when opening the docx from the browser or from file system ?

EDIT. removed attachment.

Re: docx4j generates invalid paraId and textId

PostPosted: Fri May 15, 2015 12:46 pm
by jason
markusglue wrote:The validation against Word 2010 with OpenXML Productivity Tool stated the problems above.

As a result, we are currently not sure how to guarantee a valid, non corrupt docx file. We encountered in both situations "corrupt" docx files.
The introduction of the enterprise trial edition improved the situation, but the watermarks itself and a numbering generate the hexBinary problem.


The relevant spec is [MS-DOCX].

Summarising that:

Values MUST be greater than 0 and less than 0x80000000. Any element having this
attribute MUST also have the textId attribute. See section 2.2.4 for how this attribute integrates
with ISO/IEC-29500-1.
The following W3C XML Schema ([XMLSCHEMA1] section 2.1) fragment specifies the contents of this
attribute.
<xsd:attribute name="paraId" type="w:ST_LongHexNumber"/>


For a useful description of ST_LongHexNumber, see https://msdn.microsoft.com/en-us/librar ... 12%29.aspx

docx4j generates a number greater than 0 and less than 0x80000000, but creates one which is 7 digits long:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
                        // For W14, we'll check/set paraId, textId
                        if (p.getParaId()==null) {
                                // Values MUST be greater than 0 and less than 0x80000000
                                // So let's
                               
                                String uuid = java.util.UUID.randomUUID().toString();
                                // That's 32 digits, but 8'll do nicely
                                /*
                                * 8 can create a number too large - using 7
                                * Bob Fleischman - July 24, 2014
                                */

                                uuid = uuid.replace("-", "").substring(0, 7);
                               
                                p.setParaId(uuid);
                                p.setTextId(uuid);
                        }
 
Parsed in 0.015 seconds, using GeSHi 1.0.8.4


So what docx4j generates is less than less than 0x80000000, but 7 digits, not 8.

We've not had any reports of any version of Word for which this is a problem, but you're right, it should be 8 digits. A simple fix would be to prepend "1" to the number generated, and this fix could be in docx4j v3.2.2

Thoughts?

Re: docx4j generates invalid paraId and textId

PostPosted: Fri May 15, 2015 11:29 pm
by markusglue
Thanks for your clarifications, Jason.

It's good to know those Id's should not cause a corrupt file. docx's with 7 digit paraId's and textId's open now without corrupt message in Word 2010.

Indeed we had very probably another issue related to caching while downloading the docx file, somehow even generating an invalid zip file...

Regards
Markus

Re: docx4j generates invalid paraId and textId

PostPosted: Mon May 25, 2015 11:23 am
by jason