Page 1 of 1

Arabic number/digits in PDF output

PostPosted: Sat Mar 15, 2014 6:17 am
by malik
Hi,

I am successfully able to convert from DOCX to PDF but numbers are not appearing in Arabic format.

Thanks

Re: Arabic number/digits in PDF output

PostPosted: Sat Mar 15, 2014 8:52 am
by jason
If you mean "0, 1, 2, 3, 4, 5, 6, 7, 8, 9", they should appear, so please post a short sample docx illustrating the problem.

If you mean some other numbering system, it may be that support needs to be added in the package https://github.com/plutext/docx4j/tree/ ... tnumbering

Re: Arabic number/digits in PDF output

PostPosted: Sat Mar 15, 2014 6:02 pm
by malik
Thanks for your quick response I really appreciate it. I have attached two pictures showing PDF output where numbers are appearing in English format and other picture to show how the numbers should appear in PDF output.

Hope this will be clear for your understanding.

Thanks

Re: Arabic number/digits in PDF output

PostPosted: Sun Mar 16, 2014 7:34 pm
by jason
sample docx?

Re: Arabic number/digits in PDF output

PostPosted: Mon Mar 17, 2014 6:25 am
by malik
Please find attached.

Re: Arabic number/digits in PDF output

PostPosted: Mon Mar 17, 2014 10:45 am
by jason
I'd like to fix this this week.

Please note this is a somewhat complex area, and depending how you created test.docx, there is no guarantee that the problems exhibited by it have the same underlying cause as in untitled.png. The most reliable way is to start with your real docx, deleting everything except the test case (as opposed to copy/paste into a new docx).

Created https://github.com/plutext/docx4j/issues/109

Re: Arabic number/digits in PDF output

PostPosted: Wed Mar 19, 2014 6:47 am
by malik
Hi Jason,
Do you want me to upload the original document? In fact the test.docx is a real case. I have a document which has Arabic test and well numeral. Along with text I want numeral also appear in Arabic format. The Arabic numeral example was given in untiled.png for the reference. I hope I am clear enough in explaining the issue. Please let me know if you need anything.

Thanks

Re: Arabic number/digits in PDF output

PostPosted: Wed Mar 19, 2014 5:19 pm
by jason
I think things should work as-is, provided you aren't doing something like:

Code: Select all
         fontMapper.getFontMappings().put("Times New Roman", null);
      


That would happen if for example you:

Code: Select all
      PhysicalFonts.setRegex(regex);


where your regex *excludes* "Arial Unicode MS", and then you do:

Code: Select all
      PhysicalFont font
            = PhysicalFonts.getPhysicalFonts().get("Arial Unicode MS");
         // make sure this font is allowed by your regex (if any)!!!
         fontMapper.getFontMappings().put("Times New Roman", font );  // oops!
      

Re: Arabic number/digits in PDF output

PostPosted: Thu Mar 20, 2014 7:55 am
by malik
I tried many option e.g. used different fonts other than Arial Unicode MS but invain. I have pasted by code in case if I am missing anyting.
Code: Select all
String inputfilepath = "c:\\users\\malik\\desktop\\test2.docx";// args[0];

        // Document loading (required)
        WordprocessingMLPackage wordMLPackage;
        wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));

        Mapper fontMapper = new IdentityPlusMapper();
        PhysicalFont font  = PhysicalFonts.getPhysicalFonts().get("Arial Unicode MS");
        fontMapper.getFontMappings().put("Times New Roman", font ); 
         
        wordMLPackage.setFontMapper(fontMapper);
         
        FOSettings foSettings = Docx4J.createFOSettings();

        if (saveFO)
        {
            foSettings.setFoDumpFile(new java.io.File(inputfilepath + ".fo"));
        }
        foSettings.setWmlPackage(wordMLPackage);

        // exporter writes to an OutputStream.      
        String outputfilepath;
        outputfilepath = "test.pdf";

        OutputStream os = new java.io.FileOutputStream(outputfilepath);

        Docx4J.toFO(foSettings, os, Docx4J.FLAG_NONE);



Also I have added contents from fo file for your reference.
Code: Select all
<?xml version="1.0" encoding="utf-8"?><fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format"><layout-master-set xmlns="http://www.w3.org/1999/XSL/Format"><simple-page-master margin-bottom="0.25in" margin-left="1in" margin-right="1in" margin-top="0.25in" master-name="s1-simple" page-height="11in" page-width="8.5in"><region-body margin-bottom="19mm" margin-left="0mm" margin-right="0mm" margin-top="19mm"/><region-before extent="12mm" region-name="xsl-region-before-simple"/><region-after extent="12mm" region-name="xsl-region-after-simple"/></simple-page-master><page-sequence-master master-name="s1"><repeatable-page-master-alternatives><conditional-page-master-reference master-reference="s1-simple"/></repeatable-page-master-alternatives></page-sequence-master></layout-master-set><fo:page-sequence id="section_s1" format="" master-reference="s1"><fo:flow flow-name="xsl-region-body">
 
  <fo:block break-before="auto" font-size="16.0pt" line-height="100%" space-after="0in" space-before="0in"><inline xmlns="http://www.w3.org/1999/XSL/Format" font-size="16.0pt" writing-mode="rl-tb"><inline font-family="Arial">وثيقة تجربة : 123456789</inline></inline></fo:block>
 
 
 
  </fo:flow></fo:page-sequence></fo:root>

Re: Arabic number/digits in PDF output

PostPosted: Thu Mar 20, 2014 9:53 am
by jason
The docx uses Times New Roman for Arabic. Assuming that's on your computer (it ought to be), just delete the lines:

Code: Select all
        PhysicalFont font  = PhysicalFonts.getPhysicalFonts().get("Arial Unicode MS");
        fontMapper.getFontMappings().put("Times New Roman", font );


What version of docx4j are you using? Try http://www.docx4java.org/docx4j/docx4j- ... 140314.jar (which is more recent than 3.0.1)

Re: Arabic number/digits in PDF output

PostPosted: Fri Mar 21, 2014 8:43 am
by malik
Sorry Jason, I tried by commenting the two line you pointed out and also I tried by changing the digit substitution to be Arabic digits. Even now in fo setting files it is appearing as Arabic digits but in final output PDF it still appears as non Arabic digits.

Re: Arabic number/digits in PDF output

PostPosted: Fri Mar 21, 2014 12:54 pm
by jason
When I open your test.pdf in Adobe Reader, I see Arabic:

malik.png
malik.png (13.77 KiB) Viewed 5606 times


Do you see something different?

The font is embedded in the PDF (so reported by Adobe Reader), so that shouldn't be the issue. What PDF viewer are you using?

malik wrote: I tried by changing the digit substitution to be Arabic digits.


Not sure what you mean?

Re: Arabic number/digits in PDF output

PostPosted: Fri Mar 21, 2014 5:22 pm
by malik
Hi Jason, Right, text is appearing in Arabic. It is from day 1 but digit are not appearing in Arabic that is what my problem from starting of this thread. I might not have explained my problem properly. but I hope you have now got the idea of what is the actual issue.

Re: Arabic number/digits in PDF output

PostPosted: Fri Mar 21, 2014 5:41 pm
by jason
Nope, sorry, I'm completely confused!

I see the digits "123456789" in the Word docx, in your FO (posted Thu Mar 20, 2014 7:55 am), and in the PDF output. And that's what I'd expect.

Do you see something different when you open your test.docx in Word on your PC? Do you have the Windows arabic language pack installed?

Re: Arabic number/digits in PDF output

PostPosted: Fri Mar 21, 2014 6:38 pm
by malik
I have attached again three pictures
1- word.png which when I open on my machine showing digits in Arabic format.
2- systemlocale.png which shows my current machine setting for language and locale settings.
3- ouput.png which shows the output pdf where Arabic digits are not converted to Arabic digits ( bidi )

Thanks
Malik

Re: Arabic number/digits in PDF output

PostPosted: Fri Mar 21, 2014 9:34 pm
by jason
OK now we're getting somewhere.

You mean what are called "EASTERN Arabic" numerals in English: http://en.wikipedia.org/wiki/Eastern_Arabic_numerals
or in Unicode, Arabic-Indic: http://stackoverflow.com/questions/1676 ... bic-digits

The puzzle is in 3 parts:
1. understanding when Word uses Eastern Arabic/Arabic-Indic
2. how to tell XSL Fo / FOP to use it
3. deciding when docx4j should use Eastern Arabic/Arabic-Indic (a setting, computer locale, or what?)

In this post I deal with point 1.

What does your instance of Word do, exactly? Could you please open the attached docx in Word, then post a screen shot of what you see?

In your Word "Options", choose Hindi numerals:
- Click on "Options", then click on "Additional" (or "Miscellaneous"?) on the left, the one under "Language".
- Scroll down to "Show content" (headline no. 3). There, you will find "Numerals". You can choose between "Hindi", "Context", "Arabic", and "System".
What setting do you have there?
See further:
[1] http://answers.microsoft.com/en-us/offi ... b04bd0b4b9

Note the quote:
Code: Select all
Something you should bear in mind when using these options is that they are Word options. They aren't stored in the document at all. If you save and close your document, modify the options, and re-open, the new options will be in force. If you send the document to another user, their options will be in force. Since this isn't something I do, it's difficult to know whether this is what Word users writing in Arabic/English (say) expect (or perhaps even "have become resigned to") or whether they woud be surprised that their numbers might appear differently to a recipient.


[2] http://superuser.com/questions/182039/h ... ithin-word

So maybe the answer the point 3 is that docx4j should have a property setting like Hindi_Numerals="Hindi"|"Context"|"Arabic"|"System"

Re: Arabic number/digits in PDF output

PostPosted: Sat Mar 22, 2014 2:47 am
by malik
Eventually, I did try with setting which you pointed out. But I opened the word options to make sure what you pointed out. I have attached picture for the word options from my machine. I even tried with the word you provided in your previous post. But all of them resulted in same way i.e. numerals are not converted to so called HINDI numeral. I may be wrong but why don tyou double check the library code where while converting the chars to desired unicode format may be the numeral are not taken care or ignored.

Re: Arabic number/digits in PDF output

PostPosted: Sat Mar 22, 2014 7:04 am
by jason
The docx I uploaded is exactly the one you provided, but with 2 other permutations of rPr:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
    <w:p >
      <w:pPr>
        <w:rPr>
          <w:sz w:val="32"/>
          <w:szCs w:val="32"/>
        </w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr>
          <w:rFonts w:hint="cs"/>                <!-- all as provided originally by Malik -->
          <w:sz w:val="32"/>
          <w:szCs w:val="32"/>
          <w:rtl/>
        </w:rPr>
        <w:t>وثيقة تجربة : 123456789</w:t>
      </w:r>
    </w:p>


    <w:p >
      <w:pPr>
        <w:rPr>
          <w:sz w:val="32"/>
          <w:szCs w:val="32"/>
        </w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr>
          <w:rFonts w:hint="cs"/>
          <w:sz w:val="32"/>
          <w:szCs w:val="32"/>           <!-- omit rtl, see what happens -->
        </w:rPr>
        <w:t>وثيقة تجربة : 123456789</w:t>
      </w:r>
    </w:p>

    <w:p >
      <w:pPr>
        <w:rPr>
          <w:sz w:val="32"/>
          <w:szCs w:val="32"/>
        </w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr>
          <w:sz w:val="32"/>                <!-- omit rfonts hint, see what happens -->
          <w:szCs w:val="32"/>
          <w:rtl/>
        </w:rPr>
        <w:t>وثيقة تجربة : 123456789</w:t>
      </w:r>
    </w:p>
 
Parsed in 0.003 seconds, using GeSHi 1.0.8.4


So the first paragraph at least should have appeared for you as before, since it is exactly the same! You'll need to explore why not please.

The point is that docx4j does not and should not automatically convert to Eastern Arabic (Hindi). Notice Word stores the numbers in 123 format (ie not Eastern Arabic). (Maybe even this post is hard for you to interpret correctly, if your PC is automatically converting numbers to Eastern Arabic? You may need to change its locale temporarily).

I can probably make it mimic Word's behaviour, once we know what that is.

If you can't make sense of my malik_arabic_numbering.docx, instead please make a screen copy of your test.docx, using each of the different possible settings for the Numeral option (ie Arabic, Hindi, Context, System). I guess you've done Hindi already.

I suspect "context" will be something like: Digits/numerals next to (before or after?) Latin text are represented in Arabic-Western digits, while those next to Arabic text are represented in Arabic-Indic representation. So you may need to make your example more complex.

Re: Arabic number/digits in PDF output

PostPosted: Sat Mar 22, 2014 8:51 am
by malik
I do agree with your instance that

The point is that docx4j does not and should not automatically convert to Eastern Arabic (Hindi)


Docx4j should provide an option to explicitly set the way output PDF should convert the numeral to be whatever the user choice is.
By the way I tried all possible settings for the Numeral option in word (i.e. Arabic, Hindi, Context, System) but in all cases the output PDF stayed same. Also I tried by changing locale of the my machine to be other than Arabic but no success.

Regards
Malik

Re: Arabic number/digits in PDF output

PostPosted: Sat Mar 22, 2014 10:56 am
by jason
malik wrote:By the way I tried all possible settings for the Numeral option in word (i.e. Arabic, Hindi, Context, System) but in all cases the output PDF stayed same.


Of course it would.

Right now, I'm interested in what difference that makes to what you see on the screen in WORD.

Once we know how WORD behaves, we can look to mimic that in docx4j's PDF output.

Re: Arabic number/digits in PDF output

PostPosted: Sat Mar 22, 2014 7:58 pm
by malik
Ok, I have uploaded two set of pictutes

Set One
NumerialArabic1.png
NumerialSystem1.png
NumerialContext1.png
NumerialHindi1.png

The above set is created by having digit substitution ( Control Panel -> Region and Language -> Addtional Settings -> Use Native Digits ) as Context

Re: Arabic number/digits in PDF output

PostPosted: Sat Mar 22, 2014 8:01 pm
by malik
I am attached second set in new thread since forum does allow more than 5 attachmetns
Set Two
NumerialArabic2.png
NumerialSystem2.png
NumerialContext2.png
NumerialHindi2.png

The above set is created by having digit substitution ( Control Panel -> Region and Language -> Addtional Settings -> Use Native Digits ) as National.

Hope this will give you clear idea as how to deal with these situatioans. Please let me know if you need anything else.

Re: Arabic number/digits in PDF output

PostPosted: Tue Mar 25, 2014 1:50 am
by malik
Hi Jason,

Do you still need anything from me or what i have posted in last two replies is not the one you were looking for.

Thanks
Malik

Re: Arabic number/digits in PDF output

PostPosted: Tue Mar 25, 2014 11:09 pm
by jason
Hi Malik. That's exactly what i was looking for, thanks. Give me some time to do something with it. cheers .. Jason

Re: Arabic number/digits in PDF output

PostPosted: Sat Mar 29, 2014 2:19 pm
by jason
Hi Malik

I have a working implementation for you :-)

There's a couple more cases to explore though... would you mind doing your "Set 1" and "Set 2" for the attached docx, please?

Once I've taken those into account (should be quick), I'll upload a nightly for you to try.

thanks .. Jason

Re: Arabic number/digits in PDF output

PostPosted: Sun Mar 30, 2014 7:21 am
by malik
Hi, Jason, I was not able to check the site since last couple of days. I just seen your reply and will do it first thing in the morning.

Apologies for delay

Malik

Re: Arabic number/digits in PDF output

PostPosted: Tue Apr 01, 2014 2:03 am
by malik
Hi Jason,

I believe this what you have asked for.

Set 1 is created by having digit substitution ( Control Panel -> Region and Language -> Addtional Settings -> Use Native Digits ) as Context
Set 2 is created by having digit substitution ( Control Panel -> Region and Language -> Addtional Settings -> Use Native Digits ) as National

Thanks
Malik

Re: Arabic number/digits in PDF output

PostPosted: Tue Apr 01, 2014 6:23 pm
by jason
Hi Malik

Thanks for that.

You can now try http://www.docx4java.org/docx4j/docx4j- ... 140401.jar

There are 2 properties you can set in your docx4j.properties file; these are designed to mimic the settings you have been experimenting with:

# Value can be 'Context'|'National'
docx4j.MicrosoftWindows.Region.Format.Numbers.NativeDigits=National

# Value can be 'Hindi'|'Context'|'Arabic'|'System'; default is Arabic ie 1234
docx4j.MicrosoftWord.Numeral=Arabic

let us know how you go!

kind regards .. Jason

Re: Arabic number/digits in PDF output

PostPosted: Sat Apr 12, 2014 7:14 pm
by malik
Hi Jason,

Due to some other activities I was not able to test the fix you provided. Just one question before I start testing. Where should docx4j.properties file should be located in the project. I have attached the screen shot of my simple project.

Thanks
Malik

Re: Arabic number/digits in PDF output

PostPosted: Sun Apr 13, 2014 11:31 am
by jason
docx4j.properties just needs to be on your classpath. Maybe in Netbeans you can add a directory (containing docx4j.properties) to your classpath?

Instead of that nightly build, please try the 3.1.0 beta:-

docx-java-f6/docx4j-3-1-0-beta-please-try-it-t1860.html

cheers .. Jason