Plutext

Posted: **Sat Mar 15, 2014 6:17 am**

Hi,

I am successfully able to convert from DOCX to PDF but numbers are not appearing in Arabic format.

Thanks

Posted: **Sat Mar 15, 2014 8:52 am**

If you mean "0, 1, 2, 3, 4, 5, 6, 7, 8, 9", they should appear, so please post a short sample docx illustrating the problem.

If you mean some other numbering system, it may be that support needs to be added in the package https://github.com/plutext/docx4j/tree/ ... tnumbering

Posted: **Sat Mar 15, 2014 6:02 pm**

Thanks for your quick response I really appreciate it. I have attached two pictures showing PDF output where numbers are appearing in English format and other picture to show how the numbers should appear in PDF output.

Hope this will be clear for your understanding.

Thanks

Posted: **Sun Mar 16, 2014 7:34 pm**

sample docx?

Posted: **Mon Mar 17, 2014 6:25 am**

Please find attached.

Posted: **Mon Mar 17, 2014 10:45 am**

I'd like to fix this this week.

Please note this is a somewhat complex area, and depending how you created test.docx, there is no guarantee that the problems exhibited by it have the same underlying cause as in untitled.png. The most reliable way is to start with your real docx, deleting everything except the test case (as opposed to copy/paste into a new docx).

Created https://github.com/plutext/docx4j/issues/109

Posted: **Wed Mar 19, 2014 6:47 am**

Hi Jason,
Do you want me to upload the original document? In fact the test.docx is a real case. I have a document which has Arabic test and well numeral. Along with text I want numeral also appear in Arabic format. The Arabic numeral example was given in untiled.png for the reference. I hope I am clear enough in explaining the issue. Please let me know if you need anything.

Thanks

Posted: **Wed Mar 19, 2014 5:19 pm**

I think things should work as-is, provided you aren't doing something like:

Code: Select all: fontMapper.getFontMappings().put("Times New Roman", null);

That would happen if for example you:

Code: Select all: PhysicalFonts.setRegex(regex);

where your regex *excludes* "Arial Unicode MS", and then you do:

Code: Select all: PhysicalFont font = PhysicalFonts.getPhysicalFonts().get("Arial Unicode MS"); // make sure this font is allowed by your regex (if any)!!! fontMapper.getFontMappings().put("Times New Roman", font ); // oops!

Posted: **Thu Mar 20, 2014 7:55 am**

I tried many option e.g. used different fonts other than Arial Unicode MS but invain. I have pasted by code in case if I am missing anyting.

Code: Select all: String inputfilepath = "c:\\users\\malik\\desktop\\test2.docx";// args[0]; // Document loading (required) WordprocessingMLPackage wordMLPackage; wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath)); Mapper fontMapper = new IdentityPlusMapper(); PhysicalFont font = PhysicalFonts.getPhysicalFonts().get("Arial Unicode MS"); fontMapper.getFontMappings().put("Times New Roman", font ); wordMLPackage.setFontMapper(fontMapper); FOSettings foSettings = Docx4J.createFOSettings(); if (saveFO) { foSettings.setFoDumpFile(new java.io.File(inputfilepath + ".fo")); } foSettings.setWmlPackage(wordMLPackage); // exporter writes to an OutputStream. String outputfilepath; outputfilepath = "test.pdf"; OutputStream os = new java.io.FileOutputStream(outputfilepath); Docx4J.toFO(foSettings, os, Docx4J.FLAG_NONE);

Also I have added contents from fo file for your reference.

Code: Select all: <?xml version="1.0" encoding="utf-8"?><fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format"><layout-master-set xmlns="http://www.w3.org/1999/XSL/Format"><simple-page-master margin-bottom="0.25in" margin-left="1in" margin-right="1in" margin-top="0.25in" master-name="s1-simple" page-height="11in" page-width="8.5in"><region-body margin-bottom="19mm" margin-left="0mm" margin-right="0mm" margin-top="19mm"/><region-before extent="12mm" region-name="xsl-region-before-simple"/><region-after extent="12mm" region-name="xsl-region-after-simple"/></simple-page-master><page-sequence-master master-name="s1"><repeatable-page-master-alternatives><conditional-page-master-reference master-reference="s1-simple"/></repeatable-page-master-alternatives></page-sequence-master></layout-master-set><fo:page-sequence id="section_s1" format="" master-reference="s1"><fo:flow flow-name="xsl-region-body"> <fo:block break-before="auto" font-size="16.0pt" line-height="100%" space-after="0in" space-before="0in"><inline xmlns="http://www.w3.org/1999/XSL/Format" font-size="16.0pt" writing-mode="rl-tb"><inline font-family="Arial">وثيقة تجربة : 123456789</inline></inline></fo:block> </fo:flow></fo:page-sequence></fo:root>

Posted: **Thu Mar 20, 2014 9:53 am**

The docx uses Times New Roman for Arabic. Assuming that's on your computer (it ought to be), just delete the lines:

Code: Select all: PhysicalFont font = PhysicalFonts.getPhysicalFonts().get("Arial Unicode MS"); fontMapper.getFontMappings().put("Times New Roman", font );

What version of docx4j are you using? Try http://www.docx4java.org/docx4j/docx4j- ... 140314.jar (which is more recent than 3.0.1)

Posted: **Fri Mar 21, 2014 8:43 am**

Sorry Jason, I tried by commenting the two line you pointed out and also I tried by changing the digit substitution to be Arabic digits. Even now in fo setting files it is appearing as Arabic digits but in final output PDF it still appears as non Arabic digits.

Posted: **Fri Mar 21, 2014 12:54 pm**

When I open your test.pdf in Adobe Reader, I see Arabic:

: malik.png (13.77 KiB) Viewed 6384 times

Do you see something different?

The font is embedded in the PDF (so reported by Adobe Reader), so that shouldn't be the issue. What PDF viewer are you using?

malik wrote: I tried by changing the digit substitution to be Arabic digits.

Not sure what you mean?

Posted: **Fri Mar 21, 2014 5:22 pm**

Hi Jason, Right, text is appearing in Arabic. It is from day 1 but digit are not appearing in Arabic that is what my problem from starting of this thread. I might not have explained my problem properly. but I hope you have now got the idea of what is the actual issue.

Posted: **Fri Mar 21, 2014 5:41 pm**

Nope, sorry, I'm completely confused!

I see the digits "123456789" in the Word docx, in your FO (posted Thu Mar 20, 2014 7:55 am), and in the PDF output. And that's what I'd expect.

Do you see something different when you open your test.docx in Word on your PC? Do you have the Windows arabic language pack installed?

Posted: **Fri Mar 21, 2014 6:38 pm**

I have attached again three pictures
1- word.png which when I open on my machine showing digits in Arabic format.
2- systemlocale.png which shows my current machine setting for language and locale settings.
3- ouput.png which shows the output pdf where Arabic digits are not converted to Arabic digits ( bidi )

Thanks
Malik

Posted: **Fri Mar 21, 2014 9:34 pm**

OK now we're getting somewhere.

You mean what are called "EASTERN Arabic" numerals in English: http://en.wikipedia.org/wiki/Eastern_Arabic_numerals
or in Unicode, Arabic-Indic: http://stackoverflow.com/questions/1676 ... bic-digits

The puzzle is in 3 parts:
1. understanding when Word uses Eastern Arabic/Arabic-Indic
2. how to tell XSL Fo / FOP to use it
3. deciding when docx4j should use Eastern Arabic/Arabic-Indic (a setting, computer locale, or what?)

In this post I deal with point 1.

What does your instance of Word do, exactly? Could you please open the attached docx in Word, then post a screen shot of what you see?

In your Word "Options", choose Hindi numerals:
- Click on "Options", then click on "Additional" (or "Miscellaneous"?) on the left, the one under "Language".
- Scroll down to "Show content" (headline no. 3). There, you will find "Numerals". You can choose between "Hindi", "Context", "Arabic", and "System".
What setting do you have there?
See further:
[1] http://answers.microsoft.com/en-us/offi ... b04bd0b4b9

Note the quote:

Code: Select all: Something you should bear in mind when using these options is that they are Word options. They aren't stored in the document at all. If you save and close your document, modify the options, and re-open, the new options will be in force. If you send the document to another user, their options will be in force. Since this isn't something I do, it's difficult to know whether this is what Word users writing in Arabic/English (say) expect (or perhaps even "have become resigned to") or whether they woud be surprised that their numbers might appear differently to a recipient.

[2] http://superuser.com/questions/182039/h ... ithin-word

So maybe the answer the point 3 is that docx4j should have a property setting like Hindi_Numerals="Hindi"|"Context"|"Arabic"|"System"

Posted: **Sat Mar 22, 2014 2:47 am**

Eventually, I did try with setting which you pointed out. But I opened the word options to make sure what you pointed out. I have attached picture for the word options from my machine. I even tried with the word you provided in your previous post. But all of them resulted in same way i.e. numerals are not converted to so called HINDI numeral. I may be wrong but why don tyou double check the library code where while converting the chars to desired unicode format may be the numeral are not taken care or ignored.

Posted: **Sat Mar 22, 2014 7:04 am**

The docx I uploaded is exactly the one you provided, but with 2 other permutations of rPr:

Syntax: [ Download ] [ Hide ]

Using xml Syntax Highlighting

<w:p >
<w:pPr>
<w:rPr>
<w:sz w:val="32"/>
<w:szCs w:val="32"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:rFonts w:hint="cs"/><!-- all as provided originally by Malik -->
<w:sz w:val="32"/>
<w:szCs w:val="32"/>
<w:rtl/>
</w:rPr>
<w:t>وثيقة تجربة : 123456789</w:t>
</w:r>
</w:p>


<w:p >
<w:pPr>
<w:rPr>
<w:sz w:val="32"/>
<w:szCs w:val="32"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:rFonts w:hint="cs"/>
<w:sz w:val="32"/>
<w:szCs w:val="32"/><!-- omit rtl, see what happens -->
</w:rPr>
<w:t>وثيقة تجربة : 123456789</w:t>
</w:r>
</w:p>

<w:p >
<w:pPr>
<w:rPr>
<w:sz w:val="32"/>
<w:szCs w:val="32"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:sz w:val="32"/><!-- omit rfonts hint, see what happens -->
<w:szCs w:val="32"/>
<w:rtl/>
</w:rPr>
<w:t>وثيقة تجربة : 123456789</w:t>
</w:r>
</w:p>
Parsed in 0.003 seconds,  using GeSHi 1.0.8.4

So the first paragraph at least should have appeared for you as before, since it is exactly the same! You'll need to explore why not please.

The point is that docx4j does not and should not automatically convert to Eastern Arabic (Hindi). Notice Word stores the numbers in 123 format (ie not Eastern Arabic). (Maybe even this post is hard for you to interpret correctly, if your PC is automatically converting numbers to Eastern Arabic? You may need to change its locale temporarily).

I can probably make it mimic Word's behaviour, once we know what that is.

If you can't make sense of my malik_arabic_numbering.docx, instead please make a screen copy of your test.docx, using each of the different possible settings for the Numeral option (ie Arabic, Hindi, Context, System). I guess you've done Hindi already.

I suspect "context" will be something like: Digits/numerals next to (before or after?) Latin text are represented in Arabic-Western digits, while those next to Arabic text are represented in Arabic-Indic representation. So you may need to make your example more complex.

Posted: **Sat Mar 22, 2014 8:51 am**

I do agree with your instance that

The point is that docx4j does not and should not automatically convert to Eastern Arabic (Hindi)

Docx4j should provide an option to explicitly set the way output PDF should convert the numeral to be whatever the user choice is.
By the way I tried all possible settings for the Numeral option in word (i.e. Arabic, Hindi, Context, System) but in all cases the output PDF stayed same. Also I tried by changing locale of the my machine to be other than Arabic but no success.

Regards
Malik

Posted: **Sat Mar 22, 2014 10:56 am**

malik wrote:By the way I tried all possible settings for the Numeral option in word (i.e. Arabic, Hindi, Context, System) but in all cases the output PDF stayed same.

Of course it would.

Right now, I'm interested in what difference that makes to what you see on the screen in WORD.

Once we know how WORD behaves, we can look to mimic that in docx4j's PDF output.

Posted: **Sat Mar 22, 2014 7:58 pm**

Ok, I have uploaded two set of pictutes

Set One
NumerialArabic1.png
NumerialSystem1.png
NumerialContext1.png
NumerialHindi1.png

The above set is created by having digit substitution ( Control Panel -> Region and Language -> Addtional Settings -> Use Native Digits ) as Context

Posted: **Sat Mar 22, 2014 8:01 pm**

I am attached second set in new thread since forum does allow more than 5 attachmetns
Set Two
NumerialArabic2.png
NumerialSystem2.png
NumerialContext2.png
NumerialHindi2.png

The above set is created by having digit substitution ( Control Panel -> Region and Language -> Addtional Settings -> Use Native Digits ) as National.

Hope this will give you clear idea as how to deal with these situatioans. Please let me know if you need anything else.

Posted: **Tue Mar 25, 2014 1:50 am**

Hi Jason,

Do you still need anything from me or what i have posted in last two replies is not the one you were looking for.

Thanks
Malik

Posted: **Tue Mar 25, 2014 11:09 pm**

Hi Malik. That's exactly what i was looking for, thanks. Give me some time to do something with it. cheers .. Jason

Posted: **Sat Mar 29, 2014 2:19 pm**

Hi Malik

I have a working implementation for you :-)

There's a couple more cases to explore though... would you mind doing your "Set 1" and "Set 2" for the attached docx, please?

Once I've taken those into account (should be quick), I'll upload a nightly for you to try.

thanks .. Jason

Posted: **Sun Mar 30, 2014 7:21 am**

Hi, Jason, I was not able to check the site since last couple of days. I just seen your reply and will do it first thing in the morning.

Apologies for delay

Malik

Posted: **Tue Apr 01, 2014 2:03 am**

Hi Jason,

I believe this what you have asked for.

Set 1 is created by having digit substitution ( Control Panel -> Region and Language -> Addtional Settings -> Use Native Digits ) as Context
Set 2 is created by having digit substitution ( Control Panel -> Region and Language -> Addtional Settings -> Use Native Digits ) as National

Thanks
Malik

Posted: **Tue Apr 01, 2014 6:23 pm**

Hi Malik

Thanks for that.

You can now try http://www.docx4java.org/docx4j/docx4j- ... 140401.jar

There are 2 properties you can set in your docx4j.properties file; these are designed to mimic the settings you have been experimenting with:

# Value can be 'Context'|'National'
docx4j.MicrosoftWindows.Region.Format.Numbers.NativeDigits=National

# Value can be 'Hindi'|'Context'|'Arabic'|'System'; default is Arabic ie 1234
docx4j.MicrosoftWord.Numeral=Arabic

let us know how you go!

kind regards .. Jason

Posted: **Sat Apr 12, 2014 7:14 pm**

Hi Jason,

Due to some other activities I was not able to test the fix you provided. Just one question before I start testing. Where should docx4j.properties file should be located in the project. I have attached the screen shot of my simple project.

Thanks
Malik

Posted: **Sun Apr 13, 2014 11:31 am**

docx4j.properties just needs to be on your classpath. Maybe in Netbeans you can add a directory (containing docx4j.properties) to your classpath?

Instead of that nightly build, please try the 3.1.0 beta:-

docx-java-f6/docx4j-3-1-0-beta-please-try-it-t1860.html

cheers .. Jason

Plutext

Arabic number/digits in PDF output

Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output

Re: Arabic number/digits in PDF output