Page 1 of 1

Convert docx to XHTML introduces invalid characters

PostPosted: Mon Jun 29, 2015 2:09 am
by david.zhaowl
Hi Jason,

I'm trying to convert part of a docx document into XHTML format. It has been working fine for me but sometimes when I start the application on different linux machine with under same JDK version, the conversion introduces some invalid characters for the bullets. I attached a photo to display the error. And here's my code:
Code: Select all
    /*
     * Convert the description in table cell back into html code to be saved into database
     */
    private String convertTcToXhtml(Tc tc, WordprocessingMLPackage emptyImportMLPackage)
            throws Docx4JException {

        List<Object> paragraphs = getAllElementFromObject(tc, P.class);
        if (paragraphs == null || paragraphs.size() == 0) {
            return null;
        }

        StringBuilder sb = new StringBuilder();

        /* clear all content of the clone wordMLPackage */
        emptyImportMLPackage.getMainDocumentPart().getContent()
                .removeAll(emptyImportMLPackage.getMainDocumentPart().getContent());
        removeHeaderFooterFromWord(emptyImportMLPackage);

        /* Add content of the description to new wordMLPackage. */
        emptyImportMLPackage.getMainDocumentPart().getContent().addAll(paragraphs);

        HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
        htmlSettings.setWmlPackage(emptyImportMLPackage);

        OutputStream os = new ByteArrayOutputStream();

        Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);

        Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);

        String xhtmlString = ((ByteArrayOutputStream) os).toString();

        sb.append(xhtmlString);
        /* delete useless header */
        sb.delete(0, sb.indexOf("<style>"));
        sb.insert(0, "<html><head>");
        LOGGER.debug("clean converTcToXhml : " + sb.toString());

        /* replace java bullet dot with html bullet dot. */
        String descString = sb.toString().replaceAll("\u2022", "•");

        return descString;
    }


I have a line
Code: Select all
String descString = sb.toString().replaceAll("\u2022", "•");
to replace the bullet dot with html bullet dot code, otherwise the bullet will become a double quote within the editor. The syntax of the editor is not open to me and belongs to the vendor. Not sure if you have met this problem before.

I've been running under JDK1.8.0_20, version of docx4j-ImportXHTML and docx4j is 3.2.0.

Thanks in advance for any help.

I found that actually the invalid characters are already there after line
Code: Select all
String xhtmlString = ((ByteArrayOutputStream) os).toString();
. Here is the log I see:
2015-06-28_12:36:13.816 INFO c.c.p.a.w.w.WordProcessor - clean converTcToXhml : <html><head><style><!--/*paged media */ div.header {display: none }div.footer {display: none } /*@media print { */@page { size: A4; margin: 10%; @top-center {content: element(header) } @bottom-center {content: element(footer) } }/*element styles*/ .del {text-decoration:line-through;color:red;} .ins {text-decoration:none;background:#c0ffc0;padding:1px;}
/* TABLE STYLES */

/* PARAGRAPH STYLES */
.DocDefaults {display:block;margin-top: 0in;margin-bottom: 0in;line-height: 100%;font-size: 10.0pt;}
.Normal {display:block;font-size: 12.0pt;}
.ListBullet2 {display:block;font-size: 11.0pt;}
.ListBullet {display:block;font-size: 11.0pt;}

/* CHARACTER STYLES */ span.DefaultParagraphFont {display:inline;}
--></style><script type="text/javascript"><!--function toggleDiv(divid){if(document.getElementById(divid).style.display == 'none'){document.getElementById(divid).style.display = 'block';}else{document.getElementById(divid).style.display = 'none';}}
--></script></head><body>

<!-- userBodyTop goes here -->



<div class="document">

<p class="Normal DocDefaults "><span class="DefaultParagraphFont " style="font-size: 8.0pt;">Test Case Description</span></p>

<p class="Normal DocDefaults " style="position: relative; margin-left: 0.25in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">Test Objective</span></p>

<p class="Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">To verify that the insertion and removals of the </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> to and from t</span></span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">he 6500</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> slots are as expected.</span></span></p>

<p class="Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Also to verify that the </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> Circuit Pack is properly keyed so that it cannot be inserted into the </span></span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">6500</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> slots upside down or into slots where allocated for SPs in 14 and 32-slot shelves.</span></span></p>

<p class="Normal DocDefaults "><span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;"> </span></p>

<p class="Normal DocDefaults " style="position: relative; margin-left: 0.25in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">Test Configuration</span></p>

<p class="Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;"> </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">Circuit Pack is available for inspection and removals</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">.</span></p>

<p class="Normal DocDefaults ">??</p>

<p class="Normal DocDefaults " style="position: relative; margin-left: 0.25in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Test Procedure </span></p>

<p class="ListBullet Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Turn the power the targeted </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">6500</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> shelf off so that no power related issues occur during this test.</span></span></p>

<p class="ListBullet Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Attempt to insert the </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> Circuit Pack into any empty supported slot with neighboring circuit packs already inserted. Verify that:</span></span></p>

<p class="ListBullet2 Normal DocDefaults " style="position: relative; margin-left: 1in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">The circuit pack slide smoothly into the slot without rubbing against the neighboring circuit packs or the shelf wall (if it is slot 1).</span></p>

<p class="ListBullet2 Normal DocDefaults " style="position: relative; margin-left: 1in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">That excessive force is not required to seat the card and close the latches.</span></p>

<p class="ListBullet Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Attempt to remove the </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> Circuit Pack from the slot. Verify that:</span></span></p>

<p class="ListBullet2 Normal DocDefaults " style="position: relative; margin-left: 1in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">Verify that the latches can be opened without excessive force and that the circuit pack can be removed without rubbing against other circuit packs or the shelf wall (if it is slot 1).</span></p>

<p class="ListBullet2 Normal DocDefaults " style="position: relative; margin-left: 1in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">Verify that no physical damage has occurred to the backplane of the shelf or to the backplane connector of the circuit pack.</span></p>

<p class="ListBullet Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Attempt to remove the </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> Circuit Pack from the slot. Verify that:</span></span></p>

<p class="ListBullet2 Normal DocDefaults " style="position: relative; margin-left: 1in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">Verify that the latches can be opened without excessive force and that the circuit pack can be removed without rubbing against other circuit packs or the shelf wall (if it is slot 1).</span></p>

<p class="ListBullet2 Normal DocDefaults " style="position: relative; margin-left: 1in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">Verify that no physical damage has occurred to the backplane of the shelf or to the backplane connector of the circuit pack.</span></p>

<p class="ListBullet Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Repeat the above steps for all of the supported </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">slots on the </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">6500</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;"> </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">shelf (If it is not done in every slot, please record the tested slots and record it in the Quality Center results).</span></p>

<p class="ListBullet Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Choose one of the supported slots on the </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">6500</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> shelf and attempt to insert the </span></span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> Circuit Pack upside down. </span></span></p>

<p class="ListBullet2 Normal DocDefaults " style="position: relative; margin-left: 1in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Verify that the keying on the </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> Circuit Pack and </span></span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">6500</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> shelf slot does not allow the circuit pack to be seated upside down.</span></span></p>

<p class="ListBullet2 Normal DocDefaults " style="position: relative; margin-left: 1in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">Verify that no physical damage has occurred to the backplane of the shelf or to the backplane connector of the circuit pack due to the attempted insertion.</span></p>

<p class="ListBullet Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Inspect the </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> Circu</span></span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">it Pack and verify that it</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="">???s slot-keyed to prevent being seated into slots 15 and 16</span></span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> (41 &amp; 42 for 32 sl</span></span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">ot shelf</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">)</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">. Attempt to insert the </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> Circuit Pack into an empty slot which is designated for the SP card (15 or 16). Try to insert the card both right-side up and upside down.</span></span></p>

<p class="ListBullet2 Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Verify that the keying on the </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> Circuit Pack and </span></span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">6500</span><span class="DefaultParagraphFont " style="font-size: 8.0pt;"><span class="" style="white-space:pre-wrap;"> shelf slot does not allow the circuit pack to be seated in the SP slot. </span></span></p>

<p class="Normal DocDefaults " style="position: relative; margin-left: 0.5in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">Verify that no physical damage has occurred to the backplane of the shelf or to the backplane connector of the circuit pack due to the attempted insertion</span></p>

<p class="Normal DocDefaults ">??</p>

<p class="Normal DocDefaults " style="position: relative; margin-left: 0.25in;text-indent: -0.25in;">??? <span class="DefaultParagraphFont " style="font-size: 8.0pt;">Expected Results</span></p>

<p class="Normal DocDefaults "><span class="DefaultParagraphFont " style="font-size: 8.0pt;;white-space:pre-wrap;">Verify </span><span class="DefaultParagraphFont " style="font-size: 8.0pt;">STM-0J<span class="" style="white-space:pre-wrap;"> CP insert</span>ion and removal is as expected.</span></p>

<p class="Normal DocDefaults ">??</p></div>

<div class="footnotes">

<p class="Normal DocDefaults ">??</p></div>





<!-- userBodyTail goes here -->

</body></html>

In the log you can see the "??" are the invalid characters. Any clue how would that happen?

I uploaded the docx file that I was trying to import, within which it's the "Description" content in the form. Thanks.