Plutext

Posted: **Tue Sep 06, 2011 6:10 pm**

When i compare 2 documents which are around 10kb i get below error

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.topologi.diffx.algorithm.MatrixInt.setup(MatrixInt.java:123)
at com.topologi.diffx.algorithm.DiffXFitopsy.length(DiffXFitopsy.java:188)
at com.topologi.diffx.algorithm.DiffXFitopsy.process(DiffXFitopsy.java:238)
at com.topologi.diffx.Main.diff(Main.java:323)
at com.topologi.diffx.Main.diff(Main.java:310)
at com.topologi.diffx.Main.diff(Main.java:228)
at com.topologi.diffx.Docx4jDriver.diff(Docx4jDriver.java:170)
at org.docx4j.diff.Differencer.diffWorker(Differencer.java:320)
at org.docx4j.diff.Differencer.diff(Differencer.java:298)
at CompareDocuments.main(CompareDocuments.java:117)

Sample Docx file is attached below.

102.docx: (8.65 KiB) Downloaded 299 times

Please help me in resolving this issue, i increased the JVM memory to 1400 even then nothing working out.

Posted: **Wed Sep 07, 2011 2:25 am**

Works fine for me when I run the CompareDocuments sample on less contrived data ie instead of a single 14 page paragraph. see attached for the 2 input documents i produced from yours, and the output pdf.

See the notes in the Docx4jDriver source for limitations.

Posted: **Wed Sep 07, 2011 2:57 am**

Thanks jason,

where can i find the limitations info. can you please share
thanks!

Posted: **Wed Sep 07, 2011 8:08 pm**

Hi Jason

I created the docx files using below code and this time the paragraphs were created properly.

Code: Select all: String str = "<w:p xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\" ><w:r><w:t>Text Here.....</w:t></w:r></w:p>"; wordMLPackage.getMainDocumentPart().addObject(org.docx4j.XmlUtils.unmarshalString(str));

but the comparison still gave same error as before.
i have attached the 2 docx files which i generated and also the XML format.

Please help me in this regard jason.

docxandxmlfiles.zip: (29.08 KiB) Downloaded 259 times

Posted: **Wed Sep 07, 2011 11:29 pm**

As per the comments in the Docx4jDriver class, there are heuristics to make the diff problem more tractable.

One of these is to do a paragraph level diff first, using something like:

Syntax: [ Download ] [ Hide ]

Using java Syntax Highlighting

                Body newerBody =((Document)newerPackage.getMainDocumentPart().getJaxbElement()).getBody();

                Body olderBody =((Document)olderPackage.getMainDocumentPart().getJaxbElement()).getBody();

// 2. Do the differencing

                java.io.StringWriter sw =new java.io.StringWriter();

                Docx4jDriver.diff( XmlUtils.marshaltoW3CDomDocument(newerBody).getDocumentElement(),

                                XmlUtils.marshaltoW3CDomDocument(olderBody).getDocumentElement(),

                                   sw);

// 3. Get the result
String contentStr = sw.toString();
System.out.println("Result: \n\n "+ contentStr);

                Body newBody =(Body) org.docx4j.XmlUtils

                                .unmarshalString(contentStr);
Parsed in 0.016 seconds,  using GeSHi 1.0.8.4

However, in your case, this doesn't help, since every paragraph is actually different (in the trailing whitespace).

So either you need to make the trailing whitespace the same, or alter the eclipse.compare coarse grained divide+conquer algorithmn to ignore it.

cheers .. Jason

Posted: **Thu Sep 08, 2011 1:50 am**

Hi Jason

Thanks for your reply, but in my case the content for the docx file is read from a HTML TEXT AREA and in my Java code i read this text and find \n 's and then i am creating the paragraph text using

Code: Select all: String str = "<w:p xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\" ><w:r><w:t>Text Here.....</w:t></w:r></w:p>"; wordMLPackage.getMainDocumentPart().addObject(org.docx4j.XmlUtils.unmarshalString(str));

the diff seems to be working if the docx file has 300 Lines and above which it is crashing with error.
can you suggest me how i can get the docx file created with contents from Text Area please.

Posted: **Mon Sep 12, 2011 5:43 pm**

Hi jason

Can you please help me... i tried as i mentioned in my previous reply...

Posted: **Mon Sep 12, 2011 8:40 pm**

I have already told you what to do: preprocess the paragraphs to make the trailing white space the same.

Plutext

diffx - OutOfMemory

diffx - OutOfMemory

Re: diffx - OutOfMemory

Re: diffx - OutOfMemory

Re: diffx - OutOfMemory

Re: diffx - OutOfMemory

Re: diffx - OutOfMemory

Re: diffx - OutOfMemory

Re: diffx - OutOfMemory