Page 1 of 1
Numbers not displayed in html
Posted:
Thu Apr 02, 2015 7:04 am
by sbelt
I am using docx4j 3.2.1 to convert .docx files into .html files in real time; in other words, when the web browser requests the file, tomcat reads the .docx file, uses docx4j to convert it into a html string, then streams the content back to the browser. This is running on Linux 3.0.0-26-server. My issue is that although the text looks fine, numbers (phone numbers, zipcodes, addresses, etc.) are displaying as strange, farsi-looking characters. For example, the .docx file is using Arial to display zipcode 48174 - docx4j converts this to:
<span class="" style="">٤٨١٧٤</span>
Previous posts related to this issue suggested that I might need the mscorefonts, though those posts seemed to describe trouble converting all text - not just numbers. But I installed the msfonts, and I now get:
<span class="" style="font-family: 'Times New Roman';">٤٨١٧٤</span>
So as you can see, a font has been added to the style, but the actual text is still gibberish.
Can anyone suggest what might be wrong, or how I might go about troubleshooting this?
Thanks!
Steve
Re: Numbers not displayed in html
Posted:
Thu Apr 02, 2015 8:59 pm
by jason
Save the HTML string to a file and have a look at its contents (ie instead of looking at it in the browser).
Does it still look weird?
Re: Numbers not displayed in html
Posted:
Fri Apr 03, 2015 1:48 am
by sbelt
Thanks, Jason, for responding.
I hope this achieves the same test you were suggesting: I have attached my Eclipse debugger to the running webapp, and I see that the html String created for me using
Docx4J.toHTML() method has the weird characters in it.
I confess I am new to this library - could it be related to my HTMLSettings or the Docx4J.FLAG_EXPORT_PREFER_XSL)?
Steve
Re: Numbers not displayed in html
Posted:
Fri Apr 03, 2015 3:19 am
by sbelt
I am attaching a simple sample .docx which presents this misbehavior.
- fragment.docx
- Sample .docx which obscures the numbers
- (13.13 KiB) Downloaded 332 times
Re: Numbers not displayed in html
Posted:
Fri Apr 03, 2015 4:32 am
by sbelt
I now believe chasing fonts is a red herring: I am now reproducing this error on my windows desktop.
In addition to the source file I provided in my previous post, I am attaching the resulting .html being generated. Below is my code:
- Code: Select all
public class Docx4jTest {
private static Mapper fontMapper = new BestMatchingMapper();
public Docx4jTest() throws Exception {
}
public String convertDocxToHTML(String path) {
String html = "";
try {
URL url = new URL(path);
WordprocessingMLPackage wordMLPackage = Docx4J.load(url.openStream());
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setImageDirPath( path + "_files");
htmlSettings.setImageTargetUri(path.substring(path.lastIndexOf("/")+1) + "_files");
htmlSettings.setWmlPackage(wordMLPackage);
String userCSS = "html, body, div, span, h1, h2, h3, h4, h5, h6, p, a, img, ol, ul, li, table, caption, tbody, tfoot, thead, tr, th, td " +
"{ margin: 0; padding: 0; border: 0;}" +
"body {line-height: 1;} ";
htmlSettings.setUserCSS(userCSS);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
Docx4J.toHTML(htmlSettings, baos, Docx4J.FLAG_EXPORT_PREFER_XSL);
html = baos.toString();
} catch (Exception e) {
System.err.println(e.getMessage());
html =""; //reset html - empty string means failure
}
return html;
}
/**
* @param args
*/
public static void main(String[] args) {
try {
Docx4jTest docx4j = new Docx4jTest();
String path = new File("c:\\temp\\fragment.docx").toURI().toString();
String html = docx4j.convertDocxToHTML(path);
File file1 = new File("c:\\temp\\test1.html");
FileUtils.writeStringToFile(file1, html);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Re: Numbers not displayed in html
Posted:
Fri Apr 03, 2015 7:01 pm
by jason
docx4j.s RunFontSelector class contains a method:
Using java Syntax Highlighting
private String arabicNumbering
(String text, BooleanDefaultTrue rtl, BooleanDefaultTrue cs, CTLanguage themeFontLang
)
Parsed in 0.017 seconds, using
GeSHi 1.0.8.4
which under certain conditions will convert numerals to arabic.
It seems it is being overly aggressive.
- Code: Select all
@@ -396,7 +396,9 @@
return nullRPr(document, text);
}
- text = this.arabicNumbering(text, rPr.getRtl(), rPr.getCs(), themeFontLang);
+ if (pPr!=null && pPr.getBidi()!=null && pPr.getBidi().isVal() ) {
+ text = this.arabicNumbering(text, rPr.getRtl(), rPr.getCs(), themeFontLang);
+ }
seems to fix it, but I'm not sure yet that this is the correct fix.
Re: Numbers not displayed in html
Posted:
Sat Apr 04, 2015 2:57 am
by sbelt
Thanks, Jason, last night I did notice that my failing .docx was different from others in that the /word/settings.xml part of the .docx file used '<w:themeFontLang w:val="en-US" w:eastAsia="zh-TW" w:bidi="ar-SA"/>' instead of '<w:themeFontLang w:val="en-US"/>'. The fact that your patch seems includes 'themFontLang' variable confirms to me that we are on the same track.
So, I have downloaded the docx4j source, applied your patch, and rebuilt the .jar. I can confirm that this fixes the (mis)behavior I was describing. You say that you are, "not sure yet that this is the correct fix." Do you think I should deploy your current fix, or are you still checking for a better, more-correct solution?
Thanks for all your help!
Steve