Page 1 of 1

HtmlExporterNG2, carriage returns, an the apostrophe char

PostPosted: Fri Jun 01, 2012 1:40 am
by jallen
Hi Jason,

First off let me say docx4j is awesome. OK with that out of the way here is my question.

I am using the HtmlExporterNG2 class to export to html. Everything is working well except the output eliminates all carriage returns, and the apostrophe character is being replaced with ’.
I'm not sure if there are tweaks I need to be making in the xslt, or if there is some other issue.

My code is pretty basic.
Code: Select all
try {
         AbstractHtmlExporter exporter = new HtmlExporterNG2();    
         
         HtmlSettings htmlSettings = null;
         OutputStream os = new ByteArrayOutputStream();

         javax.xml.transform.stream.StreamResult result = new javax.xml.transform.stream.StreamResult(os);
         exporter.html(wordMLPackage, result, htmlSettings);
         sReturn = os.toString();
         
      } catch (Exception e) {
         log.error("getHTML");
      }


Here is the exported html:
<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"><head><style><!-- /*paged media */ div.header {display: none } div.footer {display: none } /*@media print { */ @page { size: A4; margin: 10%; @top-center { content: element(header) } @bottom-center { content: element(footer) } } /*font definitions*/ /*element styles*/ .del {text-decoration:line-through;color:red;} .ins {text-decoration:none;background:#c0ffc0;padding:1px;} /* Word style definitions */ /* TABLE STYLES */ /* PARAGRAPH STYLES */ .DocDefaults {display:block;space-after: 4mm;line-height: 115%;font-family: Calibri;font-size: 11.0pt;} .Normal {display:block;} /* CHARACTER STYLES */ .DefaultParagraphFont {display:inline;} /* TABLE CELL STYLES */ --></style><script type="text/javascript"> function toggleDiv(divid){ if(document.getElementById(divid).style.display == 'none'){ document.getElementById(divid).style.display = 'block'; }else{ document.getElementById(divid).style.display = 'none'; } } </script></head><body> <!-- userBodyTop goes here --> <div class="document"> <p class="Normal DocDefaults " style="position: relative; margin-left: 4in;text-indent: 0.5in;space-after: 0in;"><span style="font-weight: bold;">Nelson</span><span style="font-weight: bold;"><span style="white-space:pre-wrap;"> </span></span><span style="font-weight: bold;">Adams</span></p> <p class="Normal DocDefaults " style="position: relative; margin-left: 4in;text-indent: 0.5in;space-after: 0in;"><span style="font-weight: bold;">152 Main Street</span><span style="font-weight: bold;"><span style="white-space:pre-wrap;"> </span></span></p> <p class="Normal DocDefaults " style="position: relative; margin-left: 4in;text-indent: 0.5in;"><span style="font-weight: bold;">Saratoga Springs, NY 12866</span><span style="font-weight: bold;">.</span></p> <p class="Normal DocDefaults "><span style="white-space:pre-wrap;">Dear </span>Nelson<span style="white-space:pre-wrap;"> </span>,</p> <p class="Normal DocDefaults ">Hello.</p> <p class="Normal DocDefaults "><span style="white-space:pre-wrap;">How are things going with you? Thank you for contacting me. </span><span style="font-weight: bold;">You’re the best!</span></p> <p class="Normal DocDefaults " /> <p class="Normal DocDefaults " /> <p class="Normal DocDefaults ">Sincerely,</p> <p class="Normal DocDefaults "><span style="font-weight: bold;color: #C00000;font-family: Segoe Script;font-size: 16.0pt;">Ben Franklin</span></p></div> <!-- userBodyTail goes here --> </body></html>

I attached the sample docx.
Any ideas on what i'm doing wrong?

Re: HtmlExporterNG2, carriage returns, an the apostrophe cha

PostPosted: Fri Jun 01, 2012 11:16 am
by jason
You can see the empty paragraph there in your HTML; it is just that the browser ignores it :-(

To fix this, if the paragraph is empty, I now insert   into it, which causes the browser to display it. See https://github.com/plutext/docx4j/commi ... e788052390

If you know of a better approach, please let me know.

Regarding the apostrophe, it shows up ok in my browser. Obviously something is wrong somewhere, but I need to be able to reproduce. So what does your environment look like? What does the code which emits the HTML look like?

Re: HtmlExporterNG2, carriage returns, an the apostrophe cha

PostPosted: Fri Jun 01, 2012 11:39 am
by jallen
OK that makes sense. So to get the latest change do i just download the nightly build?

The environment is java running as a managed bean in a JSF framework. It is a Domino xpage server running on a Windows box to be specific. Let me see if i can put together some simplifed code to reproduce the issue for you. I'll get something posted up tommorrow.

Re: HtmlExporterNG2, carriage returns, an the apostrophe cha

PostPosted: Sat Jun 02, 2012 2:26 am
by jallen
I figured out what is going on here. It has to do with Microsoft Word "Smart Quotes" displaying wierd in HTML. If i disable smart quotes in Word, there is no issue.

Please see:
http://www.kevinkorb.com/post/37
http://ezinearticles.com/?Microsoft-Word-Smart-Quotes-and-Internet-Article-Writers-Dont-Mix&id=15624

Re: HtmlExporterNG2, carriage returns, an the apostrophe cha

PostPosted: Sat Jun 02, 2012 2:29 am
by jallen
What would be the best way for me to replace the smart quotes in the HTML conversion?

Re: HtmlExporterNG2, carriage returns, an the apostrophe cha

PostPosted: Sun Jun 03, 2012 12:57 pm
by jason
With https://github.com/plutext/docx4j/commi ... 8f41c620dc smart quotes should work. I'll make a 'nightly' build in the next 36 hours.

Are you using IE 9? I've found that it does not honour a UTF-8 encoding specified in an XML declaration. (In the browser, View > Encoding .. if it is something other than UTF 8, change it).

The above commit excludes the XML declaration, so IE seems to use UTF-8 properly.

You mentioned you are using docx4j with Domino xpage server. Could I trouble you to share some notes about how you configured it, for the benefit of the poster in docx-java-f6/trouble-with-lotus-domino-xpages-t1100.html and others. Thanks :-)

Re: HtmlExporterNG2, carriage returns, an the apostrophe cha

PostPosted: Mon Jun 04, 2012 7:09 pm
by jason
jason wrote: I'll make a 'nightly' build in the next 36 hours.


OK, please try http://www.docx4java.org/docx4j/docx4j- ... 120604.jar

Re: HtmlExporterNG2, carriage returns, an the apostrophe cha

PostPosted: Tue Jun 12, 2012 11:10 pm
by jallen
Jason,

I apparently mispoke when I said the smart quote issue was resolved. I never changed my settings back in Word. I am still having issues when smart quotes are used. I am also have issues with bulleted lists displaying wierd characters. I have tested it in IE, Firefox, and Chrome. I have confirmed that all browsers are in fact using UTF-8 encoding.

Re: HtmlExporterNG2, carriage returns, an the apostrophe cha

PostPosted: Tue Jun 12, 2012 11:18 pm
by jason
Please attach a test docx exhibiting the issue.

Since I thought this was fixed in that nightly, it may be worth double checking you don't have an old docx4j jar on your classpath somewhere?

Re: HtmlExporterNG2, carriage returns, an the apostrophe cha

PostPosted: Tue Jun 12, 2012 11:28 pm
by jallen
I attached a sample file. I double checked and I definitely have the latest build in the classpath. The build fixed the other issue with the carriage returns not working, so I am pretty positive I have the right one.

Re: HtmlExporterNG2, carriage returns, an the apostrophe cha

PostPosted: Wed Jun 13, 2012 12:09 am
by jason
Your word.docx converted to HTML fine for me; I looked at it in IE9 and Chrome 19.0.1084.52 m

As a starting point, please try the conversion on the file using the sample code I used: https://github.com/plutext/docx4j/blob/ ... tHtml.java

Re: HtmlExporterNG2, carriage returns, an the apostrophe cha

PostPosted: Wed Jun 13, 2012 1:31 am
by jallen
ok I figured it out. The issue was with the conversion from outputstream to char string. I fixed it by specifying UTF-8 in the toString call. Thanks for the help.

Code: Select all
((ByteArrayOutputStream)os).toString("UTF-8") ;