Page 1 of 1

Docx to html wrong OL style output

PostPosted: Tue Dec 18, 2018 7:09 am
by CarlosCB1986
Converting docx to html, I have an ordered numbered list with upper roman numbers (I. II.) that is being converted to a bullet list (UL) in html.

Adding the below line I get the OL element but with the wrong format, as the list has arabic numbers (1. 2.)
SdtWriter.registerTagHandler("HTML_ELEMENT", new SdtToListSdtTagHandler());

If I also add the below line, the list is almost visually correct, as I see the enumeration with the upper roman numbers, although the font type is not the correct one, and the elements are not part of an OL, but instead just paragraphs:
htmlSettings.getFeatures().remove(ConversionFeatures.PP_HTML_COLLECT_LISTS);

None of this possibilities meet my needs, and I would like to know if this is the expected behavior or there is actually a way to properly transform a numbered list from docx to html maintaining the exact same style as a real OL html element. I can provide docx examples and code used if this is not the expected behavior.

Re: Docx to html wrong OL style output

PostPosted: Tue Dec 18, 2018 8:15 pm
by jason
Post a short xhtml test case please

Re: Docx to html wrong OL style output

PostPosted: Tue Dec 18, 2018 10:09 pm
by CarlosCB1986
You can find attached the input docx and the output html zipped.

Here is the code I used to export the docx:
public String exportToHtml(WordprocessingMLPackage wordMLPackage) throws Docx4JException {
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setWmlPackage(wordMLPackage);

// Required to properly handle UL and OL elements
// Ref: docx-java-f6/can-tohtml-emit-nested-list-tags-t2329.html
SdtWriter.registerTagHandler("HTML_ELEMENT", new SdtToListSdtTagHandler());
// Ref: docx-java-f6/html-not-containing-the-list-styles-t2547.html
// htmlSettings.getFeatures().remove(ConversionFeatures.PP_HTML_COLLECT_LISTS);

htmlSettings.setImageHandler(new ServiceImageHandler());

ByteArrayOutputStream os = new ByteArrayOutputStream();
Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
return new String(os.toByteArray());
}

As you can see, the OL element is not properly styled. In order to maintain the style, the generated html should be something like this: <ol style="list-style-type: upper-roman;"><li>First</li><li>Second</li></ol>

Re: Docx to html wrong OL style output

PostPosted: Fri Dec 21, 2018 7:22 am
by jason
If you look at the sample code at https://github.com/plutext/docx4j/blob/ ... .java#L132

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
        // list numbering:  depending on whether you want list numbering hardcoded, or done using <li>.
        if (nestLists) {
                SdtWriter.registerTagHandler("HTML_ELEMENT", new SdtToListSdtTagHandler());
        } else {
                htmlSettings.getFeatures().remove(ConversionFeatures.PP_HTML_COLLECT_LISTS);
        }
 
Parsed in 0.015 seconds, using GeSHi 1.0.8.4


Where nestLists=true, there is a 2 step process:

Step 1: pre-process the docx, to detect paragraphs which form a "list" (ie according to their w:numPr) and group these in a content control (see ListsToContentControls)

Step 2: convert the content control to <ol> or <ul> as appropriate. Step 1 writes a w:tag on the content control, which is used to determine which.

To get the result you want, the list needs appropriate CSS (list-style-type) (or in HTML 5, @type)

Currently ListsToContentControls sets w:tag to either HTML_ELEMENT=OL or HTML_ELEMENT=UL. If it set it to for example, HTML_ELEMENT=OL&numFmt=upperRoman (from Word's numbering), then in step 2, numFmt=upperRoman could be applied appropriately.

For completeness, where nestLists=false, a Word paragraph with numbering/bullets is converted to plain static HTML (ie the number is converted to text). This is by design. nestLists=false is less suited to HTML editing scenarios.