Page 1 of 1

Can toHTML emit nested list tags?

PostPosted: Fri Jan 15, 2016 7:46 am
by pmonks
G'day Word document lovers!

I'm using org.docx4j.Docx4J.toHTML() to convert a .docx file to HTML, but have become a tad saddened by how lists in the source .docx file are converted, especially nested lists. They come in as a single "flat" list, with inline styling that attempts to use indentation to reconstruct the nesting (*yuck!*). HTML directly supports nested lists (via nesting of <ol> and/or <ul> tags), and I'm a little surprised that the toHTML method doesn't utilise those. I am well aware that there are corner cases in the .docx format that don't directly map to HTML's model (e.g. nested lists where the very first element is at a deep nesting level - HTML can't represent that), but I'm willing to live with that.

My questions are:
1. Are there any conversion options or anything that will result in nested <ol> and/or <ul> tags? I've looked and experimented a bit, but haven't found anything yet.
2. If this isn't directly supported by org.docx4j.Docx4J.toHTML(), what's the best way for me to roll this myself? I'd prefer to use Java, but if XSLT is more suitable I'd consider that too.

Thanks in advance!
Peter

Re: Can toHTML emit nested list tags?

PostPosted: Sat Jan 16, 2016 4:36 pm
by jason
See org.docx4j.convert.out.html.ListsToContentControls:

Code: Select all
/**
* Create list items in OL or UL (as appropriate).
*
* We can't just use a LinkedList (stack) of list contexts,
* which we push and pop, since we have to write complete
* XML elements (as opposed to opening and closing tags).
*
* So this means either extending org.docx4j.model.structure.jaxb
* beyond sections, or some other approach, like wrapping
* list items in a content control.  Let's try that.
*
* That's like org.docx4j.convert.out.common.preprocess.Containerization
*
* So we have a 2 step process:
*
* 1.  insert the content controls
*
* 2.  use an SdtWriter to turn these into UL or OL.
*
* This class does step 1. 
*
* Step 2 is implemented by SdtToListSdtTagHandler;  it will only be used if you invoke
* SdtWriter.registerTagHandler("HTML_ELEMENT", new SdtToListSdtTagHandler())
*
* @author jharrop
*
*/


You'll need to ensure org.docx4j.convert.out.ConversionFeatures.PP_HTML_COLLECT_LISTS = "pp.html.collectlists", is turned on, the doc for which says:

In HTML the conversion process can create lists (OL, UL). This step prepares for that, by inserting content controls around list items.


But, post 3.2.1 that's the default:

Code: Select all
DEFAULT_HTML_FEATURES = {
      PP_COMMON_DEEP_COPY,
      PP_COMMON_MOVE_BOOKMARKS,
      PP_COMMON_MOVE_PAGEBREAK,
      PP_COMMON_CONTAINERIZATION,
      PP_HTML_COLLECT_LISTS, // post 3.2.1; implemented in via XSLT only; requires SdtToListSdtTagHandler to be configured
      PP_COMMON_COMBINE_FIELDS,
      PP_COMMON_DUMMY_PAGE_NUMBERING,
      PP_COMMON_DUMMY_CREATE_SECTIONS,
      PP_COMMON_TABLE_PARAGRAPH_STYLE_FIX // 3.0.2
   };


So probably you just need:

Code: Select all

SdtWriter.registerTagHandler("HTML_ELEMENT", new SdtToListSdtTagHandler())


The result will be that content controls (containing tags such as HTML_ELEMENT=OL) are inserted to represent the lists, and then these lists are converted to eg an <ol> element.

There's also unit test ListsToContentControlsTest

Re: Can toHTML emit nested list tags?

PostPosted: Sun Jan 17, 2016 1:35 pm
by pmonks
Thanks Jason. The output from docx4j v3.2.2 with that tag handler configured is a lot better, although I'm still seeing some issues.

My source document has nested bulleted lists as follows:
Code: Select all
* 1.1
    * 1.1.1
        * 1.1.1.1
        * 1.1.1.2
    * 1.1.2
    * 1.1.3
    * 1.1.4
* 1.2
    * 1.2.1
        * 1.2.1.1
        * 1.2.1.2
    * 1.2.2
* 1.3


the emitted HTML is as follows:

Code: Select all
* 1.1
    * 1.1.1
        * 1.1.2
            * 1.1.1.1
                * 1.1.1.2
                * 1.1.3
                * 1.1.4
            * 1.2
            * 1.2.1
                * 1.2.1.1
                    * 1.2.1.2
                    * 1.2.2
                * 1.3


Note that not only are the nesting levels out of whack, some of the entries in the list are out of order as well (specifically the list item with the text "1.1.1.1").

If it helps, the source .docx document was constructed from scratch, using a "blank document" and no template, in Microsoft Word for Mac 2016. I would be happy to provide it if needed.

Thanks in advance!
Peter

Re: Can toHTML emit nested list tags?

PostPosted: Mon Jan 18, 2016 6:04 pm
by jason
Sounds like there is something wrong in org.docx4j.convert.out.html.ListsToContentControls (ie step 1), which could be verified by saving the docx output of that step and inspecting...

Would definitely need the sample docx to pursue this, although time is an issue for me at the moment, so you might want to look into it yourself...

Re: Can toHTML emit nested list tags?

PostPosted: Wed Jan 20, 2016 4:12 am
by pmonks
Thanks Jason. I'll raise a bug report and attach some test documents to it.

Re: Can toHTML emit nested list tags?

PostPosted: Sat Jan 30, 2016 3:55 am
by pmonks
I've raised a bug report here: https://github.com/plutext/docx4j/issues/175

Cheers,
Peter

Re: Can toHTML emit nested list tags?

PostPosted: Fri Feb 10, 2017 8:40 pm
by jason
Thanks for that, now fixed. Fix is in http://www.docx4java.org/docx4j/docx4j- ... 170210.jar