Page 1 of 1

How to convert bullet lists in docx to html?

PostPosted: Wed Sep 17, 2014 8:19 am
by david.zhaowl
I want to convert some content in docx into html, the content includes some indentation text and some bullets and numbered lists.

I used some code in DocxToXhtmlAndBack code sample but it seems the format is lost for bullets and lists, below is the xhtml I got:
Code: Select all
<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type" /><style><!--/*paged media */ div.header {display: none }div.footer {display: none } /*@media print { */@page { size: A4; margin: 10%; @top-center {content: element(header) } @bottom-center {content: element(footer) } }/*element styles*/ .del  {text-decoration:line-through;color:red;} .ins {text-decoration:none;background:#c0ffc0;padding:1px;}
/* TABLE STYLES */

/* PARAGRAPH STYLES */
.DocDefaults {display:block;margin-bottom: 4mm;line-height: 115%;font-size: 11.0pt;}
.Normal {display:block;}
.ListParagraph {display:block;position: relative; margin-left: 0.5in;}

/* CHARACTER STYLES */ span.DefaultParagraphFont {display:inline;}
--></style><script type="text/javascript"><!--function toggleDiv(divid){if(document.getElementById(divid).style.display == 'none'){document.getElementById(divid).style.display = 'block';}else{document.getElementById(divid).style.display = 'none';}}
--></script></head><body>
 
  <!-- userBodyTop goes here -->
 
 
 
  <div class="document">
 
  <p class="Normal DocDefaults " style="position: relative; margin-left: 0.51in;"><span class="DefaultParagraphFont " style="color: #000000;font-size: 8.0pt;">Indentation text</span></p>
 
  <p class="Normal DocDefaults " style="position: relative; margin-left: 17mm;"> </p>
 
  <p class="ListParagraph Normal DocDefaults "><span class="DefaultParagraphFont " style="color: #010101;font-size: 8.0pt;">Bullet1</span></p>
 
  <p class="ListParagraph Normal DocDefaults "><span class="DefaultParagraphFont " style="color: #010101;font-size: 8.0pt;">Bullet2</span></p>
 
  <p class="ListParagraph Normal DocDefaults "><span class="DefaultParagraphFont " style="color: #010101;font-size: 8.0pt;">Bullet3</span></p>
 
  <p class="ListParagraph Normal DocDefaults "><span class="" style="">Sub bullet1</span></p>
 
  <p class="ListParagraph Normal DocDefaults "><span class="" style="">Sub bullet2</span></p>
 
  <p class="ListParagraph Normal DocDefaults " style="position: relative; margin-left: 1.5in;"> </p>
 
  <p class="ListParagraph Normal DocDefaults "><span class="DefaultParagraphFont " style="color: #010101;font-size: 8.0pt;">num1</span></p>
 
  <p class="ListParagraph Normal DocDefaults "><span class="DefaultParagraphFont " style="color: #010101;font-size: 8.0pt;">num2</span></p>
 
  <p class="ListParagraph Normal DocDefaults "><span class="DefaultParagraphFont " style="color: #010101;font-size: 8.0pt;">num3</span></p>
 
  <p class="ListParagraph Normal DocDefaults "><span class="DefaultParagraphFont " style="color: #010101;font-size: 8.0pt;">num4</span></p></div>
 
 
 
 
 
 
 
  <!-- userBodyTail goes here -->
 
  </body></html>


The only preserved format is indentation. Is there any way to keep the bullets and lists format during the conversion? And not using any css but pure html? I want to display the converted file in an online editor that handles html. Thanks.

Re: How to convert bullet lists in docx to html?

PostPosted: Wed Sep 17, 2014 9:20 am
by jason
Bullets should be preserved in docx to html.

Please post sample input docx. What version of docx4j are you using?

The XHTML output is what it is, though you are of course free to modify it to suit your purposes. In org.docx4j.convert.out, you'll see there are 2 different ways of generation XHTML.

One uses docx2xhtml-core.xslt, plus Xalan extension functions which use Java to do the hard stuff.

The other (the classes with Visitor in their name) is all Java (ie no XSLT). You'll probably find this second approach easier to modify.

By the way, we use the XHTML output in CKEditor, and save the result as a docx, preserving the original formatting, so this can be done!

Someone mentioned the other day that they use CKEditor's Word cleansing option to good effect. I don't do that myself.

Re: How to convert bullet lists in docx to html?

PostPosted: Thu Sep 18, 2014 1:43 am
by david.zhaowl
Thanks for your reply Jason. I attached the word file and I was trying to convert the content of the description column in the table.

I have read the two CKEditor posts on StackOverflow . I'm feeding this data to some 3rd party software and it's using some kind of rich text editor but I don't know if it's CKEditor. My requirement is that users can modify the content both in word file and the online editor with consistent format.

Re: How to convert bullet lists in docx to html?

PostPosted: Thu Sep 18, 2014 2:17 am
by david.zhaowl
I switched to v3.2.0 yesterday.

Here's my code:
Code: Select all
private String convertTcToXhtml(Tc tc) throws Docx4JException, JAXBException {

        List<Object> paragraphs = getAllElementFromObject(tc, P.class);
        if (paragraphs == null || paragraphs.size() == 0) {
            return null;
        }

        StringBuilder sb = new StringBuilder();

        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();

        NumberingDefinitionsPart ndp = importWordPackage.getMainDocumentPart().getNumberingDefinitionsPart();
        wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
        ndp.unmarshalDefaultNumbering();

        wordMLPackage.getMainDocumentPart().getContent().addAll(paragraphs);

        HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
        htmlSettings.setWmlPackage(wordMLPackage);

        OutputStream os = new ByteArrayOutputStream();
       
        //Jason, is this how to set the way to generate XHTML?
        Docx4jProperties.setProperty("org.docx4j.convert.out.html.HTMLExporterVisitorGenerator", true);
       
        Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_BIND_REMOVE_XML);

        String xhtmlString = ((ByteArrayOutputStream) os).toString();

        LOGGER.info("converTcToXhml str: " + xhtmlString);

        sb.append(xhtmlString);
        return sb.toString();
    }


The importWordPackage is the WordPackage for the whole word file, there's a lot of stuff in there besides the table. I created a new word package to only include the "description" content in there, and the use this to generate XHTML. I don't know how to specify the way to generate XHML. The result remains the same.

I looked at the log file, and found some errors maybe related, there're multiple lines of the "ListParagraph" error:

Code: Select all
...
2014-09-17_11:02:06.338 ERROR org.docx4j.model.PropertyResolver - Couldn't find style: ListParagraph
...
2014-09-17_11:02:06.615 WARN  org.docx4j.fonts.RunFontSelector - No mapping from null
2014-09-17_11:02:06.633 WARN  org.docx4j.fonts.RunFontSelector - No mapping from null
2014-09-17_11:02:06.639 WARN  o.d.model.listnumbering.Emulator - Couldn't find list 28
2014-09-17_11:02:06.641 ERROR org.docx4j.model.PropertyResolver - Couldn't find style: ListParagraph
2014-09-17_11:02:06.642 WARN  org.docx4j.fonts.RunFontSelector - No mapping from null
2014-09-17_11:02:06.644 ERROR org.docx4j.model.styles.StyleTree - Null node passed
...


Thanks for help.

Re: How to convert bullet lists in docx to html?

PostPosted: Thu Sep 18, 2014 2:33 pm
by jason
When I convert input.docx to html using docx4j's ConvertOutHtml sample, I see all the bullets and the numbering.

So, the only issue is creating a docx which contains just the contents of your tc.

I think you are close.

First, comment out ndp.unmarshalDefaultNumbering(); you don't need or want that. You want the existing numbering definitions in your ndp.

Since you are moving a part from one pkg to another, you need to ensure its contents have been fetched before you move it (via addTargetPart). ndp.getXml() would be sufficient.

Second, for the HTML bit, just see the ConvertOutHtml sample.

Re: How to convert bullet lists in docx to html?

PostPosted: Fri Sep 19, 2014 9:00 am
by david.zhaowl
Hi Jason, I'm unclear about two things and hope you could explain a little bit more:

1. Do I add to specify "userCSS" like the convertOutHtml does? Is the error "ERROR org.docx4j.model.PropertyResolver - Couldn't find style: ListParagraph" I got caused by incorrect css setting?

2.
Code: Select all
Docx4jProperties.setProperty("org.docx4j.convert.out.html.HTMLExporterVisitorGenerator", true);
is this line correct? I'm not sure if it's working cause the result remains the same.

Thanks.

Re: How to convert bullet lists in docx to html?

PostPosted: Fri Sep 19, 2014 9:49 am
by david.zhaowl
I removed "ndp.unmarshalDefaultNumbering()", the code is below:
Code: Select all
        NumberingDefinitionsPart ndp = importWordPackage.getMainDocumentPart().getNumberingDefinitionsPart();
        wordMLPackage.getMainDocumentPart().addTargetPart(ndp);


then I get this exception:
Code: Select all
2014-09-18_18:42:47.802 ERROR org.docx4j.model.PropertyResolver - Couldn't find style: ListParagraph
2014-09-18_18:42:47.803 WARN  o.d.openpackaging.parts.JaxbXmlPart - No PartStore defined for this package (it was probably created, not loaded).
2014-09-18_18:42:47.803 WARN  o.d.openpackaging.parts.JaxbXmlPart - /word/numbering.xml: did you initialise its contents to something?
2014-09-18_18:42:47.808 ERROR o.d.c.o.c.preprocess.PartialDeepCopy - null
java.lang.NullPointerException: null
   at org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart.fontsInUse(MainDocumentPart.java:297) ~[docx4j-3.2.0.jar:na]
   at org.docx4j.openpackaging.packages.WordprocessingMLPackage.setFontMapper(WordprocessingMLPackage.java:319) ~[docx4j-3.2.0.jar:na]
   at org.docx4j.convert.out.common.preprocess.PartialDeepCopy.process(PartialDeepCopy.java:94) ~[docx4j-3.2.0.jar:na]
   at org.docx4j.convert.out.common.Preprocess.process(Preprocess.java:76) [docx4j-3.2.0.jar:na]
   at org.docx4j.convert.out.common.Preprocess.process(Preprocess.java:134) [docx4j-3.2.0.jar:na]
   at org.docx4j.convert.out.common.AbstractWmlExporter.preprocess(AbstractWmlExporter.java:51) [docx4j-3.2.0.jar:na]
   at org.docx4j.convert.out.common.AbstractWmlExporter.preprocess(AbstractWmlExporter.java:32) [docx4j-3.2.0.jar:na]
   at org.docx4j.convert.out.common.AbstractExporter.export(AbstractExporter.java:63) [docx4j-3.2.0.jar:na]
   at org.docx4j.Docx4J.toHTML(Docx4J.java:505) [docx4j-3.2.0.jar:na]
   at com.ciena.prism.almtools.wiet.wordprocessing.WordProcessor.convertTcToXhtml(WordProcessor.java:1488) [classes/:na]


Is it that I need to deepcopy all objects in NumberingDefinitionsPart?

Re: How to convert bullet lists in docx to html?

PostPosted: Fri Sep 19, 2014 4:08 pm
by jason
Please see my previous post! vis:-

david.zhaowl wrote:Since you are moving a part from one pkg to another, you need to ensure its contents have been fetched before you move it (via addTargetPart). ndp.getXml() would be sufficient.


Regarding "Couldn't find style: ListParagraph", you'll need to copy that style to your new docx.

A better approach than copying the NDP, and modifying the styles part, would be to clone the pkg (there's a method to do that), then delete all its contents (ie in its main document part), then add the contents from your tc.

That way, styles and numbering definitions will be in place, plus any rels your tc contents might have (eg images, comments)

Where did you get the idea for property "org.docx4j.convert.out.html.HTMLExporterVisitorGenerator" from? I don't know of it.

Re: How to convert bullet lists in docx to html?

PostPosted: Sat Sep 20, 2014 1:12 am
by david.zhaowl
I got the idea from here.

The XHTML output is what it is, though you are of course free to modify it to suit your purposes. In org.docx4j.convert.out, you'll see there are 2 different ways of generation XHTML.

One uses docx2xhtml-core.xslt, plus Xalan extension functions which use Java to do the hard stuff.

The other (the classes with Visitor in their name) is all Java (ie no XSLT). You'll probably find this second approach easier to modify.


ndp.getXml returns a string, so it can't added through addTargetPart().

I'll try if the clone works. Thanks for reply.

Re: How to convert bullet lists in docx to html?

PostPosted: Sat Sep 20, 2014 1:35 am
by david.zhaowl
OK. You're talking about
Code: Select all
Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_EXPORT_PREFER_NONXSL);
.

Re: How to convert bullet lists in docx to html?

PostPosted: Sat Sep 20, 2014 8:47 am
by david.zhaowl
Thanks for your help Jason. It works. I deleted all the content, header and footer in the previous WordMLPackage, and copied tc content into it.

Altough I noticed that when using
Code: Select all
Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_EXPORT_PREFER_NONXSL);

Bullets and Lists aren't displayed correctly, as in the generated html, the <p> tags don't include the <span> tags, so all texts in bullets and lists are displayed in another paragraph. It could be a bug here.
I used
Code: Select all
Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
at last.