Page 1 of 1

Word 2010 considerations

PostPosted: Fri Jul 01, 2011 11:35 am
by mimo_2020
Thanks for this great library. Do you know if 2.7.0 will support MS Office 2010 version of docx. Any progress on the OpenDoPE Word Add-In (ie will it be promoted from pre-alpha to alpha)?

Re: Feedback on docx4j 2.7.0 release candidate?

PostPosted: Sun Jul 03, 2011 2:42 pm
by jason
mimo_2020 wrote:Do you know if 2.7.0 will support MS Office 2010 version of docx.


Support will remain the same ie they should open fine, but any Word 2010 specific features will be silently dropped.

Is this behaviour causing issues for you?

Better support is non-trivial, either requiring customising JAXB, or changes to the OpenXML schemas. I'd like to do one or the other, but I'm not sure this should be a highest priority.

mimo_2020 wrote:Any progress on the OpenDoPE Word Add-In (ie will it be promoted from pre-alpha to alpha)?


The OpenDoPE Word Add-In evolves independently of docx4j. I'm planning to publish it to Codeplex as its own project under the GPL.

Re: Feedback on docx4j 2.7.0 release candidate?

PostPosted: Tue Jul 05, 2011 9:42 pm
by tinne
Two issues I think bother most:
  • The alternate content issue. Word 2010, even if set to Word 2007 compatibility mode for a file, chooses to add alternate content for every drawing included. Manually removing the primary choices for the "fallback" ones makes the relevant document parts docx4j-processable, but every "save as" restores the alternate content sections, and jaxb throws the whole subtrees away. I see no proper solution unless to add them to the schemas some day and regenerate the jaxb classes, but knowing you edited them I guess this could be a major issue. I started to write my own schema for the mc-namespace (according to a ms forum, "none is necessary because you need to manipulate them programmatically either way") and added element and attribute reference to my copies of wml.xsd etc., but I'm far from complete.

  • Custom XML and Word 2010 bibliographies. If you create a new document in Word 2010, it brings a CustomXML part containing an empty bibliography as a CustomXML part, which is unknown to the OpenDoPE-Word-Plug-in and - as long as there are no bibliography references (didn't test with) - without relationships to any document part and thus ignored by the docx4j part loader. However, sometimes the Word-Plug-in gets confused and presents the bibliography part instead of the CustomXML part I'm interested in, and I did not find out why and how to mend this, except for manually removing the whole bibliography part and its relationship files from the docx archive. Therefore, it could be a leap forward to first add all word 2010 namespaces to the namespace mapper and explicitely rule out CustomXML parts adhering to the bibliography namespace from being identified as normal docx4j CustomXML parts.

Re: Feedback on docx4j 2.7.0 release candidate?

PostPosted: Wed Jul 06, 2011 12:38 am
by jason
tinne wrote:Manually removing the primary choices for the "fallback" ones makes the relevant document parts docx4j-processable, but every "save as" restores the alternate content sections, and jaxb throws the whole subtrees away.


Do you have Word 2010 documents which docx4j fails to open, or corrupts on saving? This would be a high priority issue.

As far as I'm aware, docx4j can open all Word 2010 docx; it is just that it (jaxb) drops the Word 2010 specific content.

See the Microsoft document [MS-DOCX] Word Extensions ... for what that is.

I'd be interested to know which of the mc:AlternateContent/mc:Fallback and other stuff people prioritise most highly. Surely not glowing text!?

tinne wrote: I see no proper solution unless to add them to the schemas some day and regenerate the jaxb classes,


Either that (which is do-able, but would take a few days at least - and its not just WML, but DrawingML etc - and runs the risk of introducing bugs), or implement a "preprocessing model for markup consumption", or possibly, add support for NVDL to JAXB.

tinne wrote:Custom XML and Word 2010 bibliographies. If you create a new document in Word 2010, it brings a CustomXML part containing an empty bibliography as a CustomXML part, which is unknown to the OpenDoPE-Word-Plug-in and - as long as there are no bibliography references (didn't test with) - without relationships to any document part and thus ignored by the docx4j part loader. However, sometimes the Word-Plug-in gets confused and presents the bibliography part instead of the CustomXML part I'm interested in, and I did not find out why and how to mend this, except for manually removing the whole bibliography part and its relationship files from the docx archive.


This needs to be fixed in the add-in. There are also "Office Well Defined Custom XML Parts" to be considered...

tinne wrote:Therefore, it could be a leap forward to first add all word 2010 namespaces to the namespace mapper and explicitely rule out CustomXML parts adhering to the bibliography namespace from being identified as normal docx4j CustomXML parts.


2.7.0 has org.docx4j.openpackaging.parts.WordprocessingML.BibliographyPart; it ought to be recognised as such when the docx is loaded.

There is work to do re other 2010 namespaces; I'm not inclined to delay 2.7.0 for this.

Re: Is it possible to update xmlns of a docx

PostPosted: Fri Jul 08, 2011 1:20 am
by docx4j_user
I am facing kind of same problem.

I am doing simple text substitution but after saving the document, I observed that some of the namespace declarations were missing and some prefix were altered as compared to the original document.

Thanks.

Re: Is it possible to update xmlns of a docx

PostPosted: Fri Jul 08, 2011 7:17 pm
by docx4j_user
I have attached two files here. original is having one text and one shape but text substitution has removed the oval shape. I tried the same thing with a complex document which has tables, checkboxes etc. but in this document the oval shape has been preserved and gave the expected output.

The original namespaces were:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk"
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
mc:Ignorable="w14 wp14">
<w:body>

But after text substitution it converted to:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage"
xmlns:ns7="http://schemas.openxmlformats.org/schemaLibrary/2006/main"
xmlns:ns8="http://schemas.openxmlformats.org/drawingml/2006/chart"
xmlns:ns9="http://schemas.openxmlformats.org/drawingml/2006/chartDrawing"
xmlns:ns10="http://schemas.openxmlformats.org/drawingml/2006/diagram"
xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"
xmlns:ns12="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:ns15="http://opendope.org/xpaths"
xmlns:ns16="http://opendope.org/conditions"
xmlns:ns17="http://opendope.org/questions"
xmlns:ns18="http://opendope.org/components"
xmlns:ns19="http://schemas.openxmlformats.org/drawingml/2006/compatibility"
xmlns:ns20="http://schemas.openxmlformats.org/drawingml/2006/lockedCanvas">
<w:body>

For text substitution I have tried both the ways
1) searched the placeholders and substituted values
Code: Select all
List<Object> jaxbNodesViaXPath = documentMainPart.getJAXBNodesViaXPath("//w:t", false);
String xml = XmlUtils.marshaltoString(wmlDocumentEl, false);
Document document = (Document) XmlUtils.unmarshallFromTemplate(xml, tagValues);

2) iterating on
Code: Select all
wmlDocumentEl.getBody().getEGBlockLevelElts()
for paragraph and table nodes and replacing text using
Code: Select all
textNode.setValue("RED")


I have used Office 2010 for creating the original document and for docx4j 2.6.0 and 2.7.0Snapshot are tried out.

Re: Is it possible to update xmlns of a docx

PostPosted: Fri Jul 08, 2011 7:59 pm
by jason
Thank you for posting the example.

The problem here is that content wrapped in mc:AlternateContent is dropped.

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
        <mc:AlternateContent>
          <mc:Choice Requires="wps">
            <w:drawing>
              <wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251659264" behindDoc="0" locked="0" layoutInCell="1" allowOverlap="1">
                <wp:simplePos x="0" y="0"/>
                <wp:positionH relativeFrom="column">
                  <wp:posOffset>584200</wp:posOffset>
                </wp:positionH>
                <wp:positionV relativeFrom="paragraph">
                  <wp:posOffset>127000</wp:posOffset>
                </wp:positionV>
                <wp:extent cx="584200" cy="374650"/>
                <wp:effectExtent l="0" t="0" r="25400" b="25400"/>
                <wp:wrapNone/>
                <wp:docPr id="1" name="Oval 1"/>
                <wp:cNvGraphicFramePr/>
                <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
                  <a:graphicData uri="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
                    <wps:wsp>
                      <wps:cNvSpPr/>
                      <wps:spPr>
                        <a:xfrm>
                          <a:off x="0" y="0"/>
                          <a:ext cx="584200" cy="374650"/>
                        </a:xfrm>
                        <a:prstGeom prst="ellipse">
                          <a:avLst/>
                        </a:prstGeom>
                      </wps:spPr>
                      <wps:style>
                        <a:lnRef idx="2">
                          <a:schemeClr val="accent1">
                            <a:shade val="50000"/>
                          </a:schemeClr>
                        </a:lnRef>
                        <a:fillRef idx="1">
                          <a:schemeClr val="accent1"/>
                        </a:fillRef>
                        <a:effectRef idx="0">
                          <a:schemeClr val="accent1"/>
                        </a:effectRef>
                        <a:fontRef idx="minor">
                          <a:schemeClr val="lt1"/>
                        </a:fontRef>
                      </wps:style>
                      <wps:bodyPr rot="0" spcFirstLastPara="0" vertOverflow="overflow" horzOverflow="overflow" vert="horz" wrap="square" lIns="91440" tIns="45720" rIns="91440" bIns="45720" numCol="1" spcCol="0" rtlCol="0" fromWordArt="0" anchor="ctr" anchorCtr="0" forceAA="0" compatLnSpc="1">
                        <a:prstTxWarp prst="textNoShape">
                          <a:avLst/>
                        </a:prstTxWarp>
                        <a:noAutofit/>
                      </wps:bodyPr>
                    </wps:wsp>
                  </a:graphicData>
                </a:graphic>
              </wp:anchor>
            </w:drawing>
          </mc:Choice>
          <mc:Fallback>
            <w:pict>
              <v:oval id="Oval 1" o:spid="_x0000_s1026" style="position:absolute;margin-left:46pt;margin-top:10pt;width:46pt;height:29.5pt;z-index:251659264;visibility:visible;mso-wrap-style:square;mso-wrap-distance-left:9pt;mso-wrap-distance-top:0;mso-wrap-distance-right:9pt;mso-wrap-distance-bottom:0;mso-position-horizontal:absolute;mso-position-horizontal-relative:text;mso-position-vertical:absolute;mso-position-vertical-relative:text;v-text-anchor:middle" o:gfxdata="UEsDBBQAB..AAAABAAEAPMAAADXBQAAAAA= " fillcolor="#4f81bd [3204]" strokecolor="#243f60 [1604]" strokeweight="2pt"/>
            </w:pict>
          </mc:Fallback>
        </mc:AlternateContent>
 
Parsed in 0.007 seconds, using GeSHi 1.0.8.4


To fix this, docx4j needs better support for Word 2010 content.

A "preprocessing model for markup consumption" would be a good quick fix for this: when reading document.xml (for example) if JAXB warns us of unexpected content, then run document.xml through a preprocessor.xml which extracts the relevant content from the mcAlternateContent element, then feed the result to JAXB again.

In this case, we can't go with the w:drawing, since a Word 2010 namespace is used, which neither Word 2007 nor docx4j support (xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"), so we'd keep the fallback w:pict.

The alternative to a preprocessing model is preserving the mc:AlternateContent element, which is a larger undertaking.

Re: Word 2010 considerations

PostPosted: Sat Jul 09, 2011 7:58 pm
by jason
http://dev.plutext.org/trac/docx4j/changeset/1605 (and following) is a mc:AlternateContent preprocessor which always selects the content of the mc:Fallback element. Its a proof of concept, as opposed to a fully compliant preprocessor; it could however be fleshed out to cover additional cases (for which we need test docx)

@docx4j_user, this works for your sample docx; in that the oval appears in the output.

The idea is that we try to unmarshal a JaxbXmlPart in the usual way, but if we get an unmarshall exception, then, rather than just ignoring that (as per docx4j 2.7.0 and allowing the offending content to be dropped), instead, we run the part through an xslt which preprocesses the content. After doing that, we ought to be able to unmarshal the part.

In effect, this process means that we fall back to Word 2007 compatible content.

The DII2009 presentation on Word 2010 dated 18 Sept 2009 is a useful overview. Slide 5 descibes the Word supported extension methods:
1. ignorable markup for new elements
2. ignorable markup for new attributes on existing elements
3. alternate content block
4. preservation of unknown parts

docx4j should handle these as follows:
1. ignorable markup for new elements: ordinary JAXB processing will drop these, but now, only after preprocessing ie in a second attempt at unmarshalling
2. ignorable markup for new attributes on existing elements: same as 1 above.
3. alternate content block: new to current svn, handled in the preprocessing xslt, which currently always uses the fallback (and assumes it is present)
4. preservation of unknown parts: docx4j should handle these

As noted above, this is not a complete implementation of the Markup Compatibility spec by any means; the main thing it does is process mc:AlternateContent. The preprocessor xslt could be extended to handle the compatibility-rule attribute.

For the specific features covered in the DII presentation:
- new numbering format: docx4j will use any mc:Fallback in the numbering part
- text effects (eg glow, reflection) will be dropped (same as Word 2007)
- new stylesWithEffects part: should be round tripped
- enhanced graphics: docx4j will use the mc:Fallback (provided it is specified, and if not, the first Choice element present, in
which cases you'd expect some Word 2010 enhancements to be dropped from whatever is there)
- MathML: new w:contentPart element is dropped (same as Word 2007)

A similar analysis needs to be done for pptx, xlsx.

Re: Word 2010 considerations

PostPosted: Fri Jul 15, 2011 7:47 pm
by docx4j_user
Thanks Jason,

I tried with nightly build @ 1619 and the conversion is happening ok now.

But there are still mismatches in namespaces, and as I am not aware of the usages of each namespace. I want you to verify and let me know that will these mismatch create any problem while using the document later on.

I have attached the new generated document.xml

Re: Word 2010 considerations

PostPosted: Fri Jul 15, 2011 11:43 pm
by jason
docx4j_user wrote:But there are still mismatches in namespaces, and as I am not aware of the usages of each namespace. I want you to verify and let me know that will these mismatch create any problem while using the document later on.


Perhaps you missed my earlier reply at docx-java-f6/word-2010-considerations-t795.html