Page 1 of 1

compare 2 docx files WITHOUT merge them

PostPosted: Fri Apr 20, 2012 1:58 am
by MaxiSbr
Hello all,

I'm trying to compare 2 docx files. I need to compare its content and all. Basically, determinate if both docx are equals or not.
I found several examples that shows how to compare and merge 2 docx files in a new one. This doesn't work for me. I need, for example, a list of differences or a simply boolean that tell me that both documents are equals or not. This is because I´m creating JUnit tests and a new document where the diferences are highlighted is useless for me.

Could someone help me with this? Exist a way to do this?

Thanks in advance,
Maxi

Re: compare 2 docx files WITHOUT merge them

PostPosted: Fri Apr 20, 2012 8:22 am
by jason
Try org.docx4j.openpackaging.parts.relationships.AlteredParts

Re: compare 2 docx files WITHOUT merge them

PostPosted: Fri Apr 20, 2012 8:33 am
by MaxiSbr
Many thanks Jason. I will research how to use it. I'll appreciate if you can add an example.

Thanks!

Re: compare 2 docx files WITHOUT merge them

PostPosted: Sat Apr 21, 2012 1:18 am
by MaxiSbr
Hi Jason,

I was able to implement my logic according your advice. It´s just what I needed.
But now, I've another problem. The situation is this:

I've a certified document that will be used to campare documents that are generated for another process. So that process always need to be generate the same file from a given source.
BUT.... there are some attributes values that are "random" values and doesn't affect the document. The specific example is: the w:id attibute of the w:bookmarkStart and w:bookmarkEnd tag. :(

So, I need some mechanism to indicate which tags or attributes should be excluded from the comparison process.
Is there any way to do this using the AlteredParts class?

Thanks a lot,
Maxi

Re: compare 2 docx files WITHOUT merge them

PostPosted: Tue Apr 24, 2012 5:10 am
by jason
Note that the AlteredParts stuff relies on the method isContentEqual in JaxbXmlPart<E>. That method marshalls the parts being compared to ByteArrayOutputStreams, and compares those.

(So note, the comparison will yield a false negative if JAXB marshalls attributes in different order.)

You could replace that method, with something which does an XML-aware diff (maybe xmlunit.sourceforge.net?). Before doing the XML-aware diff, you could pre-process to remove the attributes you don't care about.

This is probably something which is of interest to other docx4j users, so please consider contributing back your implementation. thanks!

Re: compare 2 docx files WITHOUT merge them

PostPosted: Tue Apr 24, 2012 5:46 am
by MaxiSbr
Thanks Jason for your advices. I will check this and see how this could be improved. If I've luck, I'll share the results with all.

Re: compare 2 docx files WITHOUT merge them

PostPosted: Tue Apr 24, 2012 2:05 pm
by jason
If its going to become part of docx4j, it would be preferable if it makes sense to adapt/extend src/diffx rather than introduce another jar (eg xmlunit)