Page 1 of 1

file size differences

PostPosted: Fri May 18, 2012 7:16 pm
by Binko
Well,

thank you.

It is not a great problem. I have wrote a round about. Though it is still better to use the titles and not the descriptions to identify the charts.
and have some other text in the descriptions, namely a description ;).

But this with loading and saving through WordprocessingMLPackage, seems more interesting. I think there the files should be identical, but I have a difference of 200 K.
and that's quite a lot.

Re: file size differences

PostPosted: Sat May 19, 2012 10:11 am
by jason
Thanks for emailing me your docm.

I don't think there is anything to worry about. In summary, 3 things explain the file size differences:

1. differences in zip implementation (Microsoft versus Java)
2. namespaces
3. mc:AlternateContent handling

I will explain these in turn, but first, my methodology. I used docx4j to open and save your document, so I had INPUT.docm and OUTPUT.docm to compare. I renamed them to .zip, and opened each in a zip tool (t-zip). It was then easy to see the size of each file.

I encourage you to do the same. You may find some things that my quick analysis missed.

1. differences in zip implementation (Microsoft versus Java)


For IN word\**, actual size is ~609K, packed size ~390K
For OUT word\**, actual size is ~623K, packed size ~365K

So you can see that the OUT produced by docx4j is actually bigger, although it is packed more efficiently (Java zip implementation). It is probably bigger because of namespaces, see 2 below.

2. namespaces

docx4j (JAXB) always writes all namespaces in the relevant JAXB context, which makes the file a bit bigger, for example:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:ns6="http://schemas.openxmlformats.org/schemaLibrary/2006/main" xmlns:c="http://schemas.openxmlformats.org/drawingml/2006/chart" xmlns:ns8="http://schemas.openxmlformats.org/drawingml/2006/chartDrawing" xmlns:dgm="http://schemas.openxmlformats.org/drawingml/2006/diagram" xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture" xmlns:ns11="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:dsp="http://schemas.microsoft.com/office/drawing/2008/diagram" xmlns:ns13="urn:schemas-microsoft-com:office:excel" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:ns17="urn:schemas-microsoft-com:office:powerpoint" xmlns:odx="http://opendope.org/xpaths" xmlns:odc="http://opendope.org/conditions" xmlns:odq="http://opendope.org/questions" xmlns:odi="http://opendope.org/components" xmlns:odgm="http://opendope.org/SmartArt/DataHierarchy" xmlns:ns24="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" xmlns:ns25="http://schemas.openxmlformats.org/drawingml/2006/compatibility" xmlns:ns26="http://schemas.openxmlformats.org/drawingml/2006/lockedCanvas">
 
Parsed in 0.001 seconds, using GeSHi 1.0.8.4


as compared with:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14">
 
Parsed in 0.001 seconds, using GeSHi 1.0.8.4


3. mc:AlternateContent handling

Where docx4j encounters Word 2010 Microsoft specific extensions which aren't part of the ECMA/ISO base standard, it falls back to

In this situation, you'll see something like:

Code: Select all
19.05.2012 08:21:01 *INFO * Part: /word/charts/chart1.xml (Part.java, line 150)
19.05.2012 08:21:01 *INFO * JaxbXmlPart: encountered unexpected content; pre-processing (JaxbXmlPart.java, line 243)
19.05.2012 08:21:01 *WARN * XSLTUtils: Found some mc:AlternateContent (XSLTUtils.java, line 16)
19.05.2012 08:21:01 *WARN * XSLTUtils: Selecting c:style (XSLTUtils.java, line 16)
19.05.2012 08:21:01 *DEBUG* RelationshipsPart: Loading part /word/charts/chart1.xml (RelationshipsPart.java, line 376)


The mc:AlternateContent will be used and the Word 2010 stuff dropped. In effect, the docx becomes a Word 2007 docx. In the case of the document you provided, this affects the 4 charts (though they still all end up bigger, because of the namespaces).

Re: file size differences

PostPosted: Sat May 19, 2012 8:29 pm
by Binko
well,

actually my problem was not so much with the functionality. It was that the processes did not terminate normally.

I am working in eclipse ide and with the code that I sent to you , the processes do not terminate. I tried it in rcp and in a normal java project and in both a process javaw.exe was created, but not terminated with the end of the program.

Then I checked out where this comes from and found the vbaData.xml and styleWithEffects.xml discrepancies. When I comment out these two, then the processes were terminated. Otherwise came this ClassNotFound DocumentBuilderImpl exception.

so where can I get the latest version of docx4j - i svn downloaded it from the link I wrote - is it the latest development?

Re: file size differences

PostPosted: Sat May 19, 2012 9:39 pm
by jason
As per my other reply, get the latest version from GitHub, not svn. See further http://www.docx4java.org/blog/2012/05/d ... n-eclipse/