Page 1 of 1

Corrupt word doc - perhaps recent word update?

PostPosted: Thu Feb 20, 2020 8:39 am
by stewmorg
Hi there,

Im seeing an increasing number of issues where files on some versions of word/windows when processed through docx4j (code sample below), produce a corrupt file. I've attached the main document.xml that contains loads of extra namespaces which (i think) but I'm not 100% sure could be causing this.

To replicate this, all I need to do is do something like:

Code: Select all
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(doc)
       HashMap<PartName, Part> parts = wordMLPackage.parts.parts

       parts.each { PartName name, Part part ->
          boolean process = part instanceof MainDocumentPart // ||part instanceof HeaderPart  || part instanceof FooterPart
          if (process) {
                       processPart((ContentAccessor)part, placeholders)
            }
       }

       try {
          wordMLPackage.save(doc)
}

private void processPart(ContentAccessor documentPart, List<DocuSignPlaceHolder> placeholders)
   {
   

        ClassFinder finder = new ClassFinder(Text)
        new TraversalUtil(documentPart.getContent(), finder)
// other code removed for simplicity.
}


I'm seeing this on the latest version 8.1.4 and it only started appearing for us in the last week or so.

Re: Corrupt word doc - perhaps recent word update?

PostPosted: Thu Feb 20, 2020 10:32 pm
by Lightenix
Hi,

I am having the same issue. As a I found out the problem is with 2 new namespaces in header called w16cex and w16.
Steps to reproduce the problem:
1) Document (EmptyDoc_w2007.docx) was created with Word 2007 and saved
This document did not have name spaces w16cex and w16
2) Document was reopened with Word 2019 and modified (pressed space and backspace) and saved (as EmptyDoc_w2019.docx)
Result: new attribute namespaces appered in the <w:document tag.

xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex"
xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml"
w16cex and w16 were added to mc:Ignorable="w14 w15 w16se w16cid w16 w16cex wp14"

So far both documents are valid.
3) After running WordprocessingML

File file = new File("c:\\1\\EmptyDoc_w2019.docx");
File file2 = new File("c:\\1\\EmptyDoc_w2019_modified.docx");
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(file);
List<SectionWrapper> sections = wordMLPackage.getDocumentModel().getSections();
wordMLPackage.save(file2);
if (true) {
return;
}
the output document is no more valid, because in mc:Ignorable it contains names w16 and w16cex, but namespaces are not declared anymore.
If I open document.xml and remove w16 and w16cex it is not showing error anymore.
It seems that after WordprocessingML is done it eats those namespaces, but it leaves names in mc:Ignorable.
When Docx4j does not output namespaces, would it be possible to remove names from mc:Ignorable as well, because it seems that whenever Microsoft adds new namespace Docx4j will produce invalid document unless it would recognize namespace names.
The same problem appeared once in the past when it was added w16cid: https://www.docx4java.org/forums/docx-java-f6/corruption-issue-using-getdocumentmodel-getsections-t2627.html

Best regards

Re: Corrupt word doc - perhaps recent word update?

PostPosted: Fri Feb 21, 2020 6:35 am
by jason