Page 1 of 1

Xhtml->DocX - Remove additional formatting

PostPosted: Fri Mar 16, 2018 1:11 am
by candide
Hi,

First of all, thank you for creating this incredibly useful library.

I'm trying to convert XHTML to Docx in a way that preserves the heading styles of my template DocX component. It's almost working: I do get a valid Microsoft Word document with my HTML adequately converted.

My only problem is that, while the headings (h1, h2, h3) are correctly mapped to my original template document's heading style, they don't look like the original ones. Indeed they all have additional "+ Times New Roman, black" specifications attached to their style definitions.

Looking at the XML output, I noticed that, indeed, some paragraphs have an additional "style" definition that looks like this:
Code: Select all
               
<w:rPr>
  <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
  <w:b/>
  <w:color w:val="000000"/>
</w:rPr>


I have tried to filter those items with a custom XSLT, like so:

Code: Select all
      TransformerFactory factory = TransformerFactory.newInstance();
      StringWriter sw = new StringWriter();
      StreamResult result = new StreamResult(sw);
      InputStream docxTransformer = DocxExportController.class.getResourceAsStream("/noFormatting.xslt");
      Templates templates =
            factory.newTemplates( new StreamSource( docxTransformer ) );
      wordMLPackage.getMainDocumentPart().transform(templates, null, result);


using the following XSLT:
Code: Select all
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
            xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
            xmlns:o="urn:schemas-microsoft-com:office:office"
            xmlns:v="urn:schemas-microsoft-com:vml"
            xmlns:WX="http://schemas.microsoft.com/office/word/2003/auxHint"
            xmlns:aml="http://schemas.microsoft.com/aml/2001/core"
            xmlns:w10="urn:schemas-microsoft-com:office:word"
            xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage"
            xmlns:msxsl="urn:schemas-microsoft-com:xslt"
            xmlns:ext="http://www.xmllab.net/wordml2html/ext"
            xmlns:java="http://xml.apache.org/xalan/java"
            xmlns:xml="http://www.w3.org/XML/1998/namespace"
            version="1.0"
            exclude-result-prefixes="java msxsl ext o v WX aml w10">


   <xsl:output method="xml" encoding="utf-8" omit-xml-declaration="no" indent="yes" />
   <!-- doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" -->

   <xsl:template match="@*|node()">
      <xsl:copy>
         <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
   </xsl:template>


   <xsl:template match="w:rPr"></xsl:template>



</xsl:stylesheet>


The result of that output indeed has the w:rPr nodes stripped out. But then, I can't seem to find a way to "re-serialize" that output to a valid Docx document. Here's my last best attempt:

Code: Select all
         WordprocessingMLPackage pkg = new WordprocessingMLPackage();
         NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
         pkg.addTargetPart( ndp );
//         Object unmarshalled = XmlUtils.unmarshalString(sw.toString());
         org.docx4j.convert.in.FlatOpcXmlImporter xmlPackage =
               new org.docx4j.convert.in.FlatOpcXmlImporter(new StringBufferInputStream(sw.toString()));
         WordprocessingMLPackage transformedWordMLPackage = (WordprocessingMLPackage)xmlPackage.get();



This will trigger an exception:

Code: Select all
unexpected element (uri:"http://schemas.openxmlformats.org/wordprocessingml/2006/main", local:"document"). Expected elements are <{http://schemas.microsoft.com/office/2006/xmlPackage}package>,<{http://schemas.microsoft.com/office/2006/xmlPackage}xmlData>


I'm not exactly sure what is going on here, or if I'm taking the right approach. Can something be configured at the API level to disable those pesky style attributes? Is running an XSLT on a Docx a valid method? If yes, are there documented examples of valid XSLT programs for postprocessing? What does the above error mean?

Thank you in advance for your help,

Best Regards,

Candide

Re: Xhtml->DocX - Remove additional formatting

PostPosted: Fri Mar 16, 2018 6:20 am
by jason
Your last attempt doesn't make much sense, I'll ignore that for now.

But the custom XSLT approach should be fine.

As per the Javadoc:

Code: Select all
     * If you do want to replace the content in this part, convert your result to
     * and element or input stream, then invoke unmarshal on it, then setContents.
     * (Unmarshal takes care of any unexpected content, sidestepping the issue of
     *  whether to do that before the transform (where reading the part directly),
     *  or after).


You just need to change:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
StreamResult result = new StreamResult(sw);
 wordMLPackage.getMainDocumentPart().transform(templates, null, result);
 
Parsed in 0.014 seconds, using GeSHi 1.0.8.4


to

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
                DOMResult result = new DOMResult();
               
                MainDocumentPart mdp = wordMLPackage.getMainDocumentPart();
                mdp.transform(templates, null, result);
               
                // replace the contents in the WordprocessingMLPackage
                org.w3c.dom.Document domDoc = (org.w3c.dom.Document)result.getNode();
                mdp.setContents(
                                mdp.unmarshal(domDoc.getDocumentElement()));
 
Parsed in 0.014 seconds, using GeSHi 1.0.8.4

Re: Xhtml->DocX - Remove additional formatting

PostPosted: Fri Mar 16, 2018 8:16 pm
by candide
Awesome!
Works Great, thanks a ton, Jason.

Re: Xhtml->DocX - Remove additional formatting

PostPosted: Wed Apr 20, 2022 6:19 pm
by mithilesh.jha
@Candidate ,Can you please attach full code to convert xhtml to docx.

Re: Xhtml->DocX - Remove additional formatting

PostPosted: Thu Apr 21, 2022 7:31 am
by jason