Page 1 of 1

diffx - OutOfMemory

PostPosted: Tue Sep 06, 2011 6:10 pm
by suncity65
When i compare 2 documents which are around 10kb i get below error

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.topologi.diffx.algorithm.MatrixInt.setup(MatrixInt.java:123)
at com.topologi.diffx.algorithm.DiffXFitopsy.length(DiffXFitopsy.java:188)
at com.topologi.diffx.algorithm.DiffXFitopsy.process(DiffXFitopsy.java:238)
at com.topologi.diffx.Main.diff(Main.java:323)
at com.topologi.diffx.Main.diff(Main.java:310)
at com.topologi.diffx.Main.diff(Main.java:228)
at com.topologi.diffx.Docx4jDriver.diff(Docx4jDriver.java:170)
at org.docx4j.diff.Differencer.diffWorker(Differencer.java:320)
at org.docx4j.diff.Differencer.diff(Differencer.java:298)
at CompareDocuments.main(CompareDocuments.java:117)

Sample Docx file is attached below.
102.docx
(8.65 KiB) Downloaded 299 times


Please help me in resolving this issue, i increased the JVM memory to 1400 even then nothing working out.

Re: diffx - OutOfMemory

PostPosted: Wed Sep 07, 2011 2:25 am
by jason
Works fine for me when I run the CompareDocuments sample on less contrived data ie instead of a single 14 page paragraph. see attached for the 2 input documents i produced from yours, and the output pdf.

See the notes in the Docx4jDriver source for limitations.

Re: diffx - OutOfMemory

PostPosted: Wed Sep 07, 2011 2:57 am
by suncity65
Thanks jason,

where can i find the limitations info. can you please share
thanks!

Re: diffx - OutOfMemory

PostPosted: Wed Sep 07, 2011 8:08 pm
by suncity65
Hi Jason

I created the docx files using below code and this time the paragraphs were created properly.
Code: Select all
String str = "<w:p xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\" ><w:r><w:t>Text Here.....</w:t></w:r></w:p>";
wordMLPackage.getMainDocumentPart().addObject(org.docx4j.XmlUtils.unmarshalString(str));


but the comparison still gave same error as before.
i have attached the 2 docx files which i generated and also the XML format.

Please help me in this regard jason.

docxandxmlfiles.zip
(29.08 KiB) Downloaded 259 times

Re: diffx - OutOfMemory

PostPosted: Wed Sep 07, 2011 11:29 pm
by jason
As per the comments in the Docx4jDriver class, there are heuristics to make the diff problem more tractable.

One of these is to do a paragraph level diff first, using something like:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
                Body newerBody = ((Document)newerPackage.getMainDocumentPart().getJaxbElement()).getBody();
                Body olderBody = ((Document)olderPackage.getMainDocumentPart().getJaxbElement()).getBody();
                               
                // 2. Do the differencing
                java.io.StringWriter sw = new java.io.StringWriter();
                Docx4jDriver.diff( XmlUtils.marshaltoW3CDomDocument(newerBody).getDocumentElement(),
                                XmlUtils.marshaltoW3CDomDocument(olderBody).getDocumentElement(),
                                   sw);
               
                // 3. Get the result
                String contentStr = sw.toString();
                System.out.println("Result: \n\n " + contentStr);
                Body newBody = (Body) org.docx4j.XmlUtils
                                .unmarshalString(contentStr);
 
Parsed in 0.016 seconds, using GeSHi 1.0.8.4


However, in your case, this doesn't help, since every paragraph is actually different (in the trailing whitespace).

So either you need to make the trailing whitespace the same, or alter the eclipse.compare coarse grained divide+conquer algorithmn to ignore it.

cheers .. Jason

Re: diffx - OutOfMemory

PostPosted: Thu Sep 08, 2011 1:50 am
by suncity65
Hi Jason

Thanks for your reply, but in my case the content for the docx file is read from a HTML TEXT AREA and in my Java code i read this text and find \n 's and then i am creating the paragraph text using

Code: Select all
String str = "<w:p xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\" ><w:r><w:t>Text Here.....</w:t></w:r></w:p>";
wordMLPackage.getMainDocumentPart().addObject(org.docx4j.XmlUtils.unmarshalString(str));


the diff seems to be working if the docx file has 300 Lines and above which it is crashing with error.
can you suggest me how i can get the docx file created with contents from Text Area please.

Re: diffx - OutOfMemory

PostPosted: Mon Sep 12, 2011 5:43 pm
by suncity65
Hi jason

Can you please help me... i tried as i mentioned in my previous reply...

Re: diffx - OutOfMemory

PostPosted: Mon Sep 12, 2011 8:40 pm
by jason
I have already told you what to do: preprocess the paragraphs to make the trailing white space the same.