Page 1 of 1

Unicode chars and HTML entities are not processed OK.

PostPosted: Sat Oct 12, 2013 11:09 pm
by meletis
Hello!

I'm trying to import the following html using XHTMLImporter but the unicode characters appear as small squares.

Code: Select all
<!DOCTYPE html>

<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
  </head>
  <body>
    <h4>Questions</h4>
    <table class="table table-bordered table-condensed">
      <thead>
        <tr>
          <th>
            <h6>Variable Name</h6>
          </th>
          <th>
            <h6>Question Text</h6>
          </th>
          <th>
            <h6>Saved Value</h6>
          </th>
        </tr>
      </thead>
      <tbody>
        <tr class="gray">
          <td>satisfy</td>
          <td>Do you agree?</td>
          <td>
            <table class="table table-bordered table-condensed">
              <tr>
                <td>1</td>
                <td>पूरी तरह से सहमत</td>
              </tr>
              <tr>
                <td>2</td>
                <td>पूरी तरह से असहमत</td>
              </tr>
            </table>
          </td>
        </tr>
      </tbody>
    </table>
  </body>
</html>


Then I thought I should HTML-escape the unicode text so I replaced it with the corresponding HTML entities like the following:

Code: Select all
<!DOCTYPE html>

<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
  </head>
  <body>
    <h4>Questions</h4>
    <table class="table table-bordered table-condensed">
      <thead>
        <tr>
          <th>
            <h6>Variable Name</h6>
          </th>
          <th>
            <h6>Question Text</h6>
          </th>
          <th>
            <h6>Saved Value</h6>
          </th>
        </tr>
      </thead>
      <tbody>
        <tr class="gray">
          <td>satisfy</td>
          <td>Do you agree?</td>
          <td>
            <table class="table table-bordered table-condensed">
              <tr>
                <td>1</td>
                <td>&amp;#2346;&amp;#2370;&amp;#2352;&amp;#2368; &amp;#2340;&amp;#2352;&amp;#2361; &amp;#2360;&amp;#2375; &amp;#2360;&amp;#2361;&amp;#2350;&amp;#2340;</td>
              </tr>
              <tr>
                <td>2</td>
                <td>&amp;#2346;&amp;#2370;&amp;#2352;&amp;#2368; &amp;#2340;&amp;#2352;&amp;#2361; &amp;#2360;&amp;#2375; &amp;#2309;&amp;#2360;&amp;#2361;&amp;#2350;&amp;#2340;</td>
              </tr>
            </table>
          </td>
        </tr>
      </tbody>
    </table>
  </body>
</html>


But the result was almost the same (see attached). Am I missing something?

Meletis.

Re: Unicode chars and HTML entities are not processed OK.

PostPosted: Sun Oct 13, 2013 9:11 am
by jason
When I paste that into Word, I see it uses

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
<w:rFonts w:ascii="Mangal" w:hAnsi="Mangal" w:cs="Mangal"/>
Parsed in 0.000 seconds, using GeSHi 1.0.8.4


With a nightly build of docx4j-XHTMLImport, there is:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
public static void addFontMapping(String cssFontFamily, RFonts rFonts)
 
Parsed in 0.015 seconds, using GeSHi 1.0.8.4


If you set that, things should work. You can experiment with different fonts for those various values.

By the way, current docx4j nightlies contain https://github.com/plutext/docx4j/blob/ ... ector.java

See the comment at the top for how w:rFonts works.