Page 1 of 1

How to process docx to replace bold tags

PostPosted: Mon Aug 29, 2016 2:33 pm
by gizwhu
I am converting docx to html and currently bold texts are not converted correctly.

The following is in the document.xml before conversion

Code: Select all
<w:tr>
   <w:trPr/>
   <w:tc>
      <w:tcPr>
         <w:tcW w:w="9016" w:type="dxa"/>
         <w:gridSpan w:val="4"/>
      </w:tcPr>
      <w:p>
         <w:pPr>
            <w:ind w:left="0" w:firstLine="0" w:right="95"/>
            <w:rPr>
               <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
               <w:szCs w:val="24"/>
            </w:rPr>
         </w:pPr>
         <w:r>
            <w:rPr>
               <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
               <w:b w:val="on"/>
               <w:bCs w:val="on"/>
               <w:szCs w:val="24"/>
               <w:lang w:val="en-SG" w:eastAsia="en-GB"/>
            </w:rPr>
            <w:t xml:space="preserve">BOLD TEXT</w:t>
         </w:r>
      </w:p>
   </w:tc>
</w:tr>


Apparently the following works :
Code: Select all
<w:tr>
   <w:trPr/>
   <w:tc>
      <w:tcPr>
         <w:tcW w:w="9016" w:type="dxa"/>
         <w:gridSpan w:val="4"/>
      </w:tcPr>
      <w:p>
         <w:pPr>
            <w:ind w:left="0" w:firstLine="0" w:right="95"/>
            <w:rPr>
               <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
               <w:szCs w:val="24"/>
            </w:rPr>
         </w:pPr>
         <w:r>
            <w:rPr>
               <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
               <w:b/>
               <w:bCs/>
               <w:szCs w:val="24"/>
               <w:lang w:val="en-SG" w:eastAsia="en-GB"/>
            </w:rPr>
            <w:t xml:space="preserve">BOLD TEXT</w:t>
         </w:r>
      </w:p>
   </w:tc>
</w:tr>


My question is, how do i process the docx to find all <w:b w:val="on"/> and replace it with <w:b/> ?

Re: How to process docx to replace bold tags

PostPosted: Thu Sep 01, 2016 2:48 pm
by jason
You have:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
               <w:b w:val="on"/>
               <w:bCs w:val="on"/>
 
Parsed in 0.000 seconds, using GeSHi 1.0.8.4


If you look at https://github.com/plutext/docx4j/blob/ ... t.java#L44 you'll see bold property is of type BooleanDefaultTrue

https://github.com/plutext/docx4j/blob/ ... tTrue.java

Now have a look at https://github.com/plutext/docx4j/blob/ ... l.xsd#L293

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
        <xsd:complexType name="BooleanDefaultTrue">
                <!--
               
                Replaces CT_OnOff, for more intuitive field generation by JAXB.
               
                http://www.w3.org/TR/2001/REC-xmlschema ... 2/#boolean
                says ·boolean· can have the following legal literals {true, false, 1, 0}.
               
                Don't use this for elements where Word 2007 passes 'on' or 'off' (which type="xsd:boolean" default="true" allows).
               
                Note that a third party processor could choose to use 'on' or 'off', even where
                Word doesn't.  This would cause trouble.
               
                Hopefully the result of standardisation will be to formally get rid
                of 'on' and 'off', and use xsd:boolean instead.  
               
                Equivalent type="xsd:boolean" default="true" formulation:                      
               
                <xsd:attribute name="legacy" use="optional" type="xsd:boolean"
                default="true">
                <xsd:annotation>
                <xsd:documentation>Use Legacy Numbering
                Properties</xsd:documentation>
                </xsd:annotation>
                </xsd:attribute>
               
               
                -->
                <xsd:attribute name="val" type="xsd:boolean" default="true">
                        <xsd:annotation>
                                <xsd:documentation>True/False Value (default is
                                        true)</xsd:documentation>
                        </xsd:annotation>
                </xsd:attribute>
        </xsd:complexType>
       
 
Parsed in 0.001 seconds, using GeSHi 1.0.8.4


especially

Note that a third party processor could choose to use 'on' or 'off', even where Word doesn't. This would cause trouble.


It seems that Sun/Oracle JAXB in Java 1.8.0_05 quietly converts an illegal value such as "on", to "false".

MOXy, on the other hand, detects it as an issue:

Code: Select all
Exception Description: The object [on], of class [class java.lang.String], from mapping [org.eclipse.persistence.oxm.m
INFO org.docx4j.jaxb.JaxbValidationEventHandler .handleEvent line 134 - continuing (with possible element/attribute loss)
WARN org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware .unmarshal line 479 -
Exception Description: The object [on], of class [class java.lang.String], from mapping [org.eclipse.persistence.oxm.mappings.XMLDirectMapping[val-->@ns0:val]] with descriptor [XMLDescriptor(org.docx4j.wml.BooleanDefaultTrue --> [DatabaseTable(ns3:webExtensionCreated), DatabaseTable(ns3:chartTrackingRefBased), DatabaseTable(ns3:collapsed), DatabaseTable(ns3:webExtensionLinked)])], could not be converted to [class java.lang.Boolean].
Local Exception Stack:
Exception [EclipseLink-3002] (Eclipse Persistence Services - 2.5.2.v20140319-9ad6abd): org.eclipse.persistence.exceptions.ConversionException
Exception Description: The object [on], of class [class java.lang.String], from mapping [org.eclipse.persistence.oxm.mappings.XMLDirectMapping[val-->@ns0:val]] with descriptor [XMLDescriptor(org.docx4j.wml.BooleanDefaultTrue --> [DatabaseTable(ns3:webExtensionCreated), DatabaseTable(ns3:chartTrackingRefBased), DatabaseTable(ns3:collapsed), DatabaseTable(ns3:webExtensionLinked)])], could not be converted to [class java.lang.Boolean].
   at org.eclipse.persistence.exceptions.ConversionException.couldNotBeConverted(ConversionException.java:75)
   at org.eclipse.persistence.internal.helper.ConversionManager.convertObjectToBoolean(ConversionManager.java:276)
:
   at org.eclipse.persistence.internal.oxm.record.SAXUnmarshaller.unmarshal(SAXUnmarshaller.java:895)
   at org.eclipse.persistence.oxm.XMLUnmarshaller.unmarshal(XMLUnmarshaller.java:659)
   at org.eclipse.persistence.jaxb.JAXBUnmarshaller.unmarshal(JAXBUnmarshaller.java:585)
   at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:445)
   at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:346)
   at org.docx4j.openpackaging.parts.JaxbXmlPart.getContents(JaxbXmlPart.java:159)
   at org.docx4j.openpackaging.parts.JaxbXmlPart.getJaxbElement(JaxbXmlPart.java:181)
   at org.docx4j.samples.OpenMainDocumentAndTraverse.main(OpenMainDocumentAndTraverse.java:92)
DEBUG org.docx4j.utils.ResourceUtils .getResourceViaProperty line 47 - docx4j.jaxb.JaxbValidationEventHandler resolved to custom-preprocessor.xslt
DEBUG org.docx4j.utils.ResourceUtils .getResource line 70 - Attempting to load: custom-preprocessor.xslt
WARN org.docx4j.utils.ResourceUtils .getResource line 81 - Couldn't get resource: custom-preprocessor.xslt
WARN org.docx4j.utils.ResourceUtils .getResourceViaProperty line 52 - custom-preprocessor.xslt: custom-preprocessor.xslt not found via classloader.
WARN org.docx4j.utils.ResourceUtils .getResourceViaProperty line 55 - Property docx4j.jaxb.JaxbValidationEventHandler resolved to missing resource custom-preprocessor.xslt; using org/docx4j/jaxb/mc-preprocessor.xslt
DEBUG org.docx4j.utils.ResourceUtils .getResource line 70 - Attempting to load: org/docx4j/jaxb/mc-preprocessor.xslt


So if you use MOXy, custom-preprocessor.xslt or mc-preprocessor.xslt could be used to fix the XML.

What software was used to create this docx?

Re: How to process docx to replace bold tags

PostPosted: Thu Sep 01, 2016 5:18 pm
by gizwhu
This docx is generated by a document template generator called "WindWard".

In my case, I did notice that Jaxb is quietly converting the value to false. Unfortunately I require this value to be true.

Re: How to process docx to replace bold tags

PostPosted: Sun Sep 04, 2016 11:22 am
by jason
The most expedient way to handle this is to use MOXy as your JAXB implementation, which is just a matter of adding:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
                <dependency>
                        <groupId>org.docx4j</groupId>
                        <artifactId>docx4j-MOXy-JAXBContext</artifactId>
                        <version>3.3.0</version>
                </dependency>
                <dependency>
                        <groupId>org.eclipse.persistence</groupId>
                        <artifactId>org.eclipse.persistence.moxy</artifactId>
                        <version>2.5.2</version>
                </dependency>
                <dependency>
                        <groupId>javax.mail</groupId>
                        <artifactId>mail</artifactId>
                        <version>1.4.7</version>
                </dependency>
 
Parsed in 0.001 seconds, using GeSHi 1.0.8.4


Then add a rule to org/docx4j/jaxb/mc-preprocessor.xslt to change "on" to 1 or true. Its better to copy mc-preprocessor.xslt to say mc-preprocessor-custom.xslt then make the change there. Tell docx4j about this via docx4j property "docx4j.jaxb.JaxbValidationEventHandler":

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
        public static Templates getMcPreprocessor() throws IOException, TransformerConfigurationException {
               
                if (mcPreprocessorXslt==null) {
                       
                        Source xsltSource  = new StreamSource(
                                        ResourceUtils.getResourceViaProperty("docx4j.jaxb.JaxbValidationEventHandler",
                                                        "org/docx4j/jaxb/mc-preprocessor.xslt")
                                        );
                        mcPreprocessorXslt = XmlUtils.getTransformerTemplate(xsltSource);
                }
               
                return mcPreprocessorXslt;
               
        }
 
Parsed in 0.015 seconds, using GeSHi 1.0.8.4

Re: How to process docx to replace bold tags

PostPosted: Sun Sep 11, 2016 8:46 pm
by gizwhu
How do I "Tell docx4j about this via docx4j property "docx4j.jaxb.JaxbValidationEventHandler""?

I am using the following https://github.com/plutext/docx-html-editor

Re: How to process docx to replace bold tags

PostPosted: Mon Sep 12, 2016 11:09 am
by jason
Include it in a docx4j.properties file on your classpath; an example is at https://github.com/plutext/docx4j/blob/ ... rties#L109