Page 1 of 1

compare and merge 2 ms word docx or doc.

PostPosted: Mon May 18, 2009 12:00 pm
by bihag
I want to compare 2 ms word document and merge it as temp document to show merge result, like in office 2007 compare functionality is there. where we have to give 2 diff document and it will show compression of both document.

please send me sample code or classes which i have to use to full fill these requirement.


appreciate your help ...

Re: compare and merge 2 ms word docx or doc.

PostPosted: Mon May 18, 2009 6:08 pm
by jason
ok first, docx4j deals mainly with docx documents, so if one of the documents you want to compare is an old binary .doc, you will need to convert it to docx first. There is proof of concept code for doing this in org.docx4j.convert.in, but that code is far from fully featured. Your best bet for doing this might be the b2xtranslator project (on sourceforge). That is C# ... it'd be good to port it to Java.

So now, assuming you are comparing 2 docx, we have code for doing a compare of 2 paragraphs (or 2 sdt's) in the org.docx4j.diff package. This works pretty well, and produces a result which has tracked changes.

Whilst you could try using the underlying library to compare two main document parts, ymmv. I think you would be better off using LCS or some similar to find which paragraphs correspond, and then use org.docx4j.diff on those. The source of org.eclipse.compare has been sitting (unused afaik) in the docx4j source tree for a while now; you might try starting with that.

Finally, to produce a valid resulting document, you need to ensure ids point to the correct relationship (eg for images, hyperlinks), and styles are defined etc. A
good explanation of what you need to do can be found at
http://blogs.msdn.com/ericwhite/archive ... l-sdk.aspx

cheers .. Jason

Re: compare and merge 2 ms word docx or doc.

PostPosted: Tue May 19, 2009 5:12 am
by bihag
appreciate your help ...

Re: compare and merge 2 ms word docx or doc.

PostPosted: Wed May 20, 2009 1:32 pm
by bihag
thanks jason,

as per your suggestion i tried comparing docx with docx4j org.docx4j.diff.ParagraphDifferencer class but it's comparing the word xml.

what i did is, I checkout docx4j project from svn and configured in my system and run ParagraphDifferencer class. it's giving resultant output in xml.

please guide me how to compare two word docx documents with api.

I want to send just two doucment and it should give me compared/merged document as output...

please send me some sample code or java class where i can compare/merged 2 docs with docx4j api.

And main thing is that we have our web application running on linux server. In that case i Think open xml ( Microsoft SDK) won't work ... but I am not sure about eclipse's compare API, Will eclipse.compare API works on linux ?

Thanks alot in advance for your help.

Re: compare and merge 2 ms word docx or doc.

PostPosted: Wed May 20, 2009 5:58 pm
by jason
as per your suggestion i tried comparing docx with docx4j org.docx4j.diff.ParagraphDifferencer class but it's comparing the word xml.

what i did is, I checkout docx4j project from svn and configured in my system and run ParagraphDifferencer class. it's giving resultant output in xml.


Well, yes, that's the whole point of that class.

If you don't want to compare the actual documents (ie their underlying xml), and instead, would be happy with a crude comparison of their text content, you could extract the text first and then perform LCS on that. But your result in this case will be a series of unformatted paragraphs.

Perhaps you could clarify your requirements?

Re: compare and merge 2 ms word docx or doc.

PostPosted: Thu May 21, 2009 5:28 am
by bihag
Hi,

Comparing doc xml is ok with our requirement but than i have to create docx from that newly generated xml ...

but problem is we are using linux server, there office won't be available ... is there any class available that will create document from generated xml...

My requirement is ...

we have web application running on the linux server there we will have many documents date wise, so if client want to compare newer document with older document than we have to handle it with programmatic, from java code
Program :- The documents are available in server side, we have to compare 2 documents and generate new document to show the differences to client.

Re: compare and merge 2 ms word docx or doc.

PostPosted: Thu May 21, 2009 6:21 am
by jason
The XML produced by the diff is docx WordprocessingML.

So you can use docx4j to insert it into an existing docx, or use docx4j to create a new one. See the samples for basic examples of how to use docx4j.

Bear in mind though my earlier remark about styles, relationship id's etc, which will likely come into play depending on your 2 source documents.

Re: compare and merge 2 ms word docx or doc.

PostPosted: Thu May 21, 2009 7:49 am
by bihag
thanks a lot ...

but will it work if server is linux and office 2007 not installed in server.

sorry 4 asking basic questions, I am new to java development ...

Re: compare and merge 2 ms word docx or doc.

PostPosted: Thu May 21, 2009 5:20 pm
by jason
but will it work if server is linux and office 2007 not installed in server.


Yes, sure, absolutely :-)

Re: compare and merge 2 ms word docx or doc.

PostPosted: Fri May 22, 2009 5:37 am
by bihag
Hi,

Thanks ...

I am able to convert xml to docx ... but when I am trying to get xml from docx it's throwing classcast exception.

here is my code

String inputfilepath = "d:/compareDoc/new4.docx";
// Open a document from the file system
// 1. Load the Package
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));

// 2. Fetch the document part
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();

// This line is throwing ClassCastException: org.docx4j.wml.P cannot be cast to org.docx4j.wml.Document
org.docx4j.wml.Document wmlDocumentEl = (org.docx4j.wml.Document)documentPart.getJaxbElement();

Last line is throwing ClassCastException: org.docx4j.wml.P cannot be cast to org.docx4j.wml.Document

if the document is available in the project path it is working fine but when I am giving full path that time it's throwing exception.

please provide me some solution or any other code were I can get xml from docx file.


Thanking you ... if i got the xml than with org.docx4j.diff.ParagraphDifferencer class I can find the diff and generate new document.


-------------- And

for some document i am getting full xml from above given code, but when i am comparing that generated xml ( from those docx ) exception is coming, that time i am using ParagraphDifferencer class only.

my code is ...

String BASE_DIR = "D://compareDoc//";

String paraL = BASE_DIR + "new3.xml";
String paraR = BASE_DIR + "new4.xml";
P pl = loadParagraph(paraL);
P pr = loadParagraph(paraR);

protected static org.docx4j.wml.P loadParagraph(String filename) throws Exception {

java.io.File f = new java.io.File(filename);
java.io.InputStream is = new java.io.FileInputStream(f);
JAXBContext jc = org.docx4j.jaxb.Context.jc;

Unmarshaller u = jc.createUnmarshaller();

//u.setSchema(org.docx4j.jaxb.WmlSchema.schema);
u.setEventHandler(new org.docx4j.jaxb.JaxbValidationEventHandler());

return (org.docx4j.wml.P)u.unmarshal( is ); // Here it's throwing exception
}

stack trace for exception is ...
java.lang.ClassCastException: org.docx4j.wml.Document cannot be cast to org.docx4j.wml.P
at org.docx4j.diff.ParagraphDifferencer.loadParagraph(ParagraphDifferencer.java:724)

Re: compare and merge 2 ms word docx or doc.

PostPosted: Fri May 22, 2009 5:51 pm
by jason
Bihag, I'm not really in a position to help with what really are elementary questions about using Java (rather than issues with docx4j).

But ...

if the document is available in the project path it is working fine but when I am giving full path that time it's throwing exception.

please provide me some solution or any other code were I can get xml from docx file.


You need to lookup how to use File.

java.lang.ClassCastException: org.docx4j.wml.Document cannot be cast to org.docx4j.wml.P
at org.docx4j.diff.ParagraphDifferencer.loadParagraph(ParagraphDifferencer.java:724)


You need to feed that method org.docx4j.wml.P, not org.docx4j.wml.Document.

As I said before, the differencer accepts single paragraphs or sdts, not entire main document parts.

Re: compare and merge 2 ms word docx or doc.

PostPosted: Mon May 25, 2009 5:35 am
by bihag
ok thanks a lot ... will try to find out solution ...

Re: compare and merge 2 ms word docx or doc.

PostPosted: Sun Jun 21, 2009 7:52 am
by jason
There is now code to make it easy to compare 2 documents (ie w:document/w:body). Previously, what you got out of the box was just the ability to compare 2 paragraphs, or 2 content controls.

The following sample shows how to use it : org/docx4j/samples/CompareDocuments.java

The sample compares the 2 input docx, and displays the result as a pdf.

The org.docx4j.diff uses code which is in src/diffx, so you'll need to add that. (I've put this in a separate directory tree, since that code is by and large 3rd party code. It has no dependencies on docx4j, and is useful on its own (for example, I use also it in a .NET application courtesy of IKVM) ).