Page 1 of 1

Numbers not displayed in html

PostPosted: Thu Apr 02, 2015 7:04 am
by sbelt
I am using docx4j 3.2.1 to convert .docx files into .html files in real time; in other words, when the web browser requests the file, tomcat reads the .docx file, uses docx4j to convert it into a html string, then streams the content back to the browser. This is running on Linux 3.0.0-26-server. My issue is that although the text looks fine, numbers (phone numbers, zipcodes, addresses, etc.) are displaying as strange, farsi-looking characters. For example, the .docx file is using Arial to display zipcode 48174 - docx4j converts this to:

<span class="" style="">٤٨١٧٤</span>

Previous posts related to this issue suggested that I might need the mscorefonts, though those posts seemed to describe trouble converting all text - not just numbers. But I installed the msfonts, and I now get:

<span class="" style="font-family: 'Times New Roman';">٤٨١٧٤</span>

So as you can see, a font has been added to the style, but the actual text is still gibberish.

Can anyone suggest what might be wrong, or how I might go about troubleshooting this?

Thanks!
Steve

Re: Numbers not displayed in html

PostPosted: Thu Apr 02, 2015 8:59 pm
by jason
Save the HTML string to a file and have a look at its contents (ie instead of looking at it in the browser).

Does it still look weird?

Re: Numbers not displayed in html

PostPosted: Fri Apr 03, 2015 1:48 am
by sbelt
Thanks, Jason, for responding.

I hope this achieves the same test you were suggesting: I have attached my Eclipse debugger to the running webapp, and I see that the html String created for me using
Docx4J.toHTML() method has the weird characters in it.

I confess I am new to this library - could it be related to my HTMLSettings or the Docx4J.FLAG_EXPORT_PREFER_XSL)?

Steve

Re: Numbers not displayed in html

PostPosted: Fri Apr 03, 2015 3:19 am
by sbelt
I am attaching a simple sample .docx which presents this misbehavior.

fragment.docx
Sample .docx which obscures the numbers
(13.13 KiB) Downloaded 332 times

Re: Numbers not displayed in html

PostPosted: Fri Apr 03, 2015 4:32 am
by sbelt
I now believe chasing fonts is a red herring: I am now reproducing this error on my windows desktop.

In addition to the source file I provided in my previous post, I am attaching the resulting .html being generated. Below is my code:

Code: Select all
public class Docx4jTest {
  private static Mapper fontMapper = new BestMatchingMapper();
 
  public Docx4jTest() throws Exception {
  }
 
  public String convertDocxToHTML(String path) {
    String html = "";
    try {
      URL url = new URL(path);
      WordprocessingMLPackage wordMLPackage = Docx4J.load(url.openStream());     
      HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
      htmlSettings.setImageDirPath( path + "_files");
      htmlSettings.setImageTargetUri(path.substring(path.lastIndexOf("/")+1) + "_files");
      htmlSettings.setWmlPackage(wordMLPackage);
      String userCSS = "html, body, div, span, h1, h2, h3, h4, h5, h6, p, a, img, ol, ul, li, table, caption, tbody, tfoot, thead, tr, th, td " +
          "{ margin: 0; padding: 0; border: 0;}" +
          "body {line-height: 1;} ";
      htmlSettings.setUserCSS(userCSS);
      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
      Docx4J.toHTML(htmlSettings, baos, Docx4J.FLAG_EXPORT_PREFER_XSL);
      html = baos.toString();
    } catch (Exception e) {
      System.err.println(e.getMessage());
      html =""; //reset html - empty string means failure
    }
    return html;
  }
 
 
  /**
   * @param args
   */
  public static void main(String[] args) {
    try {
      Docx4jTest docx4j = new Docx4jTest();
     String path = new File("c:\\temp\\fragment.docx").toURI().toString();
     String html = docx4j.convertDocxToHTML(path);
     
     File file1 = new File("c:\\temp\\test1.html");     
     FileUtils.writeStringToFile(file1, html);
   
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

Re: Numbers not displayed in html

PostPosted: Fri Apr 03, 2015 7:01 pm
by jason
docx4j.s RunFontSelector class contains a method:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
    private String arabicNumbering(String text, BooleanDefaultTrue rtl, BooleanDefaultTrue cs, CTLanguage themeFontLang )
 
Parsed in 0.017 seconds, using GeSHi 1.0.8.4


which under certain conditions will convert numerals to arabic.

It seems it is being overly aggressive.

Code: Select all
@@ -396,7 +396,9 @@
         return nullRPr(document, text);
      }      
        
-      text = this.arabicNumbering(text, rPr.getRtl(), rPr.getCs(), themeFontLang);
+      if (pPr!=null && pPr.getBidi()!=null && pPr.getBidi().isVal() ) {
+         text = this.arabicNumbering(text, rPr.getRtl(), rPr.getCs(), themeFontLang);
+      }


seems to fix it, but I'm not sure yet that this is the correct fix.

Re: Numbers not displayed in html

PostPosted: Sat Apr 04, 2015 2:57 am
by sbelt
Thanks, Jason, last night I did notice that my failing .docx was different from others in that the /word/settings.xml part of the .docx file used '<w:themeFontLang w:val="en-US" w:eastAsia="zh-TW" w:bidi="ar-SA"/>' instead of '<w:themeFontLang w:val="en-US"/>'. The fact that your patch seems includes 'themFontLang' variable confirms to me that we are on the same track.

So, I have downloaded the docx4j source, applied your patch, and rebuilt the .jar. I can confirm that this fixes the (mis)behavior I was describing. You say that you are, "not sure yet that this is the correct fix." Do you think I should deploy your current fix, or are you still checking for a better, more-correct solution?

Thanks for all your help!

Steve