Page 1 of 1

XHTMLImporter and headings

PostPosted: Tue Mar 11, 2014 9:05 pm
by willi.firulais
Hallo,

Do you have some hints what properties (or something else) to set to get the headings transformed?

XHTMLImporter is really a great Approach for getting .docx documents out of .xhtml Markup.

I've played around with XHTMLImporter and noted that the tags h1, h2, h3 are rendered as bold text in the word document.
But I would have expected:

h1 .. heading1
h2 .. heading2
h3 .. heading3

Thx in advance for any hint,
Willi

P.S. With this line of code the Heading1 is working in general for the transformed word document

wordMLPackage.getMainDocumentPart().addStyledParagraphOfText("Heading1", "As is Heading1");

Re: XHTMLImporter and headings

PostPosted: Thu Mar 13, 2014 8:27 pm
by jason
willi.firulais wrote: I would have expected:

h1 .. heading1
h2 .. heading2
h3 .. heading3


I'll look to add an option to do this in the next week or so.

Re: XHTMLImporter and headings

PostPosted: Thu Mar 13, 2014 11:32 pm
by willi.firulais
jason wrote:I'll look to add an option to do this in the next week or so.


From your Statement I assume that there is no mapping currently.
But it's really great to here that such a Feature will be shortly in XHTMLImporter.

Thx a lot,
Willi

Re: XHTMLImporter and headings

PostPosted: Fri Mar 14, 2014 9:02 pm
by willi.firulais
As a simple Workaround I've used HtmlCleaner to add a css class to the h1 tag so XHTMLImporter can match the class with the Word WL Style Name.

The disadvantage of this Workaround is that all css style Information that is inherited from eg. html page is given to the paragraph. If a style is defined in word for headings (e.g. because loading a word template before tansformation) it would be great that as an enhancement request - the word style superseeds the css style.

Code: Select all
<h1>My Chapter</h1>

get's transformed to
Code: Select all
<h1 class="Heading1">My Chapter</h1>


h1.class=Heading1
or
h1.class=berschrift1 .. note that the heading in german word is named "beschrift1"

Code: Select all
       
                HtmlCleaner cleaner = new HtmlCleaner();
                CleanerProperties props = cleaner.getProperties();

                CleanerTransformations transformations = new CleanerTransformations();

                TagTransformation tt = null;
                tt = new TagTransformation("h1", "h1", true);
                tt.addAttributeTransformation("class", "Heading1");
                transformations.addTransformation(tt);

                props.setCleanerTransformations(transformations);

                TagNode tagNode = cleaner.clean(xhtml);

Re: XHTMLImporter and headings

PostPosted: Fri Mar 14, 2014 10:25 pm
by jason
There is:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
/**
 * CLASS_TO_STYLE_ONLY: a Word style matching a class attribute will
 * be used, and nothing else
 *
 * CLASS_PLUS_OTHER: a Word style matching a class attribute will
 * be used; other css will be translated to direct formatting
 *
 * IGNORE_CLASS: css will be translated to direct formatting
 *
 */

public enum FormattingOption {

        CLASS_TO_STYLE_ONLY, CLASS_PLUS_OTHER, IGNORE_CLASS;
}
 
Parsed in 0.014 seconds, using GeSHi 1.0.8.4


In XHTMLImporterImpl, there is setParagraphFormatting. The default is CLASS_PLUS_OTHER, but it sounds like you want CLASS_TO_STYLE_ONLY (in which case you can disregard the below)

CLASS_PLUS_OTHER

If you were to use CLASS_PLUS_OTHER, it can be useful to have CSS on your HTML which matches your target docx. This prevents unwanted default CSS values having effect.

You can use HtmlCssHelper.createCssForStyles to generate that.

For the StyleTree arg, you can do:

StyleTree styleTree = wordMLPackage.getMainDocumentPart().getStyleTree();

Note that the styles which Word shows in its user interface aren't necessarily defined in the styles part of the docx. Typically, Word only writes an actual definition in the styles part if the style is actually being used in the document.

Of the styles which are actually defined, docx4j typically builds a StyleTree from that subset which are actually used somewhere in the document:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
        /**
         * Build a StyleTree for stylesInUse.
         *
         * @param stylesInUse styles actually in use in the main document part, headers/footers, footnotes/endnotes
         * @param allStyles styles defined in the style definitions part
         */

        public StyleTree(Set<String> stylesInUse, Map<String, Style> allStyles)
 
Parsed in 0.013 seconds, using GeSHi 1.0.8.4


Your first step then is to ensure the styles your are interested in are actually defined in styles.xml.

After that, you could define a Set<String> stylesInUse which specifies all defined styles (ie the keys in Map<String, Style> allStyles) and use that to construct StyleTree.

For XHTML import purposes I guess it could be useful to add a constructor:

public StyleTree(Map<String, Style> allStyles)

Re: XHTMLImporter and headings

PostPosted: Mon Mar 17, 2014 11:08 am
by jason
willi.firulais wrote:As a simple Workaround I've used HtmlCleaner to add a css class to the h1 tag so XHTMLImporter can match the class with the Word WL Style Name.


I'm considering something similar .. a mapping of element names to Word styles.

In the CLASS_TO_STYLE_ONLY and CLASS_PLUS_OTHER cases, the mapping would be used only if there was no class val (or there was no Word style having name = class val ?). ie @class trumps element name

In the IGNORE_CLASS case, the mapping of element names to Word styles could always be used. If you don't want that, just make the map empty.

willi.firulais wrote:The disadvantage of this Workaround is that all css style Information that is inherited from eg. html page is given to the paragraph. If a style is defined in word for headings (e.g. because loading a word template before tansformation) it would be great that as an enhancement request - the word style superseeds the css style.


Since the approach I describe above would happen after XHTML renderer has parsed the xhtml + css, it wouldn't alter the css computed by XHTML renderer.

Re: XHTMLImporter and headings

PostPosted: Mon Mar 17, 2014 11:25 pm
by willi.firulais
Hallo,

It's great to here that from you. It sounds realy great that there will be a mapping of element names to Word styles.

The CLASS_TO_STYLE_ONLY, CLASS_PLUS_OTHER, IGNORE_CLASS should be settable per mapping entry.
This mapping should only be used if someone want to customize the default behaviour (as you have described).

eg. (some kind of pseudo JSON, to express what i think of):

{
mapping: [
{class: "Standard", style: "Standard", FormattingOption: CLASS_PLUS_OTHER},
{class: "Head1", style: "berschrift1", FormattingOption: CLASS_TO_STYLE_ONLY}
]
}

Thx, Willi

Re: XHTMLImporter and headings

PostPosted: Wed Mar 19, 2014 8:25 am
by jason
Hi, what's the rationale / use case for making FormattingOption settable per mapping entry?

Certainly it can be done, but unless there's a good reason, it may be better to keep it simple...

At present, FormattingOption can be set indepedently for paragraph, run and table level styles.

Re: XHTMLImporter and headings

PostPosted: Mon Aug 04, 2014 8:07 pm
by jason
There is support for mapping eg h1 to "Heading 1" style, in the 3.2.0 beta.

To enable it, you'll need a properties file with the content:

https://github.com/plutext/docx4j-Impor ... properties

Re: XHTMLImporter and headings

PostPosted: Fri Jun 09, 2017 12:48 am
by csekol
Hi,

I tried this feature, but didn't work for me.

I tested with this simple xhtml:
Code: Select all
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
        PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <title>Heading</title>
</head>
<body>
<h1>level 1</h1>
</body>
</html>


and with this content in docx4j-ImportXHTML.properties:
Code: Select all
docx4j-ImportXHTML.Element.Heading.MapToStyle=true


The result was still not having the heading style:
Code: Select all
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:ns10="http://schemas.openxmlformats.org/schemaLibrary/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:c="http://schemas.openxmlformats.org/drawingml/2006/chart" xmlns:ns13="http://schemas.openxmlformats.org/drawingml/2006/chartDrawing" xmlns:dgm="http://schemas.openxmlformats.org/drawingml/2006/diagram" xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture" xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:dsp="http://schemas.microsoft.com/office/drawing/2008/diagram" xmlns:ns18="urn:schemas-microsoft-com:office:excel" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:ns22="urn:schemas-microsoft-com:office:powerpoint" xmlns:ns24="http://schemas.microsoft.com/office/2006/coverPageProps" xmlns:odx="http://opendope.org/xpaths" xmlns:odc="http://opendope.org/conditions" xmlns:odq="http://opendope.org/questions" xmlns:oda="http://opendope.org/answers" xmlns:odi="http://opendope.org/components" xmlns:odgm="http://opendope.org/SmartArt/DataHierarchy" xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" xmlns:ns32="http://schemas.openxmlformats.org/drawingml/2006/compatibility" xmlns:ns33="http://schemas.openxmlformats.org/drawingml/2006/lockedCanvas">
    <w:body>
        <w:p>
            <w:pPr>
                <w:spacing w:after="0"/>
                <w:ind w:left="0"/>
                <w:jc w:val="left"/>
            </w:pPr>
            <w:r>
                <w:rPr>
                    <w:rFonts w:hAnsi="Times New Roman" w:ascii="Times New Roman"/>
                    <w:b/>
                    <w:i w:val="false"/>
                    <w:color w:val="000000"/>
                </w:rPr>
                <w:t>level 1</w:t>
            </w:r>
        </w:p>
        <w:sectPr>
            <w:headerReference w:type="default" r:id="rId4"/>
            <w:footerReference w:type="default" r:id="rId5"/>
            <w:pgSz w:code="9" w:h="16839" w:w="11907"/>
            <w:pgMar w:left="1440" w:bottom="1440" w:right="1440" w:top="1440"/>
        </w:sectPr>
    </w:body>
</w:document>


After some investigation I found in XHTMLImporterImpl in isHeading() and handleHeadingElement(), it is using the getLocalName() which returns null. With getTagName() it returns the tag name.

Is getLocalName() ok here? Should not be getTagName() instead?

Thanks,
László