Page 1 of 1

Convert html to docx with RTL for hebrew/arabic language

PostPosted: Mon Jul 23, 2018 9:48 pm
by kfir91
Hello all,
i try to convert html string to docx file.
its working ok but when i try to enter hebrew words and rtl it its not working good.

My code:

Code: Select all
   public static void main(String[] args) throws Exception
   {

        String html = "<html><head><title>Import me</title></head><body style=\"text-align: right;\"><table border=\"1\"><tbody><tr><td width=\"100\">מס'</td><td>נסיון:</td><td width=\"200\">לורם איפסום דולור סיט אמט, קונסקטורר אדיפיסינג אלית ושבעגט ליבם סולגק. בראיט</td></tr></tbody></table></body></html>";
        docx4j(html);
   }
   
   private static void docx4j(String html) throws Exception {
        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
       
        NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
        wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
        ndp.unmarshalDefaultNumbering();
       
        BooleanDefaultTrue trueVar = new BooleanDefaultTrue();

      XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
      List l = XHTMLImporter.convert( html, null);
      Tbl t = (Tbl) l.get(0);
      t.getTblPr().setBidiVisual(trueVar);
      l.set(0, t);

        wordMLPackage.getMainDocumentPart().getContent().addAll(l);
       
        System.out.println(XmlUtils.marshaltoString(wordMLPackage
                .getMainDocumentPart().getJaxbElement(), true, true));

        wordMLPackage.save(new java.io.File("C:\\test\\sample.docx"));
        System.out.println("done");
   }


its look like:
Image

but its need to look like:

Image

the ' and : need to be in the left side and not right side of the text.

how can i set all the document rtl. ?

Re: Convert html to docx with RTL for hebrew/arabic language

PostPosted: Tue Jul 24, 2018 9:55 am
by jason
Hi there, interesting question :-)

Typically, a web page would indicate RTL like so:

Code: Select all
<p dir="rtl">ליצור מהרשת רשת כלל עולמית באמת!</p>


see https://www.w3.org/International/articl ... unctuation

Or use css, direction: rtl

docx4j-ImportXHTML uses Flying Saucer under the covers to process the XHTML (we read its css), but unfortunately, it doesn't pass the direction property. See https://groups.google.com/forum/#!topic ... lSAbpvP-zY and https://groups.google.com/forum/#!topic ... 0CfuYfpQ6I

So, 2 options.

Option 1: patch Flying Saucer () to keep RTL info. (It doesn't need to do anything with it for our purposes). See further
our fork at our fork https://github.com/plutext/flyingsaucer and https://github.com/flyingsaucerproject/ ... r/pull/138

(Interestingly, see also https://github.com/danfickle/openhtmltopdf/issues/9 which is based on FS...)

Option 2: if the current span contains Arabic or Hebrew unicode characters, assume it should be RTL. This might be quick/easy, but would that general rule produce expected results?

Consider the case where all characters (except punctuation?) are Arabic or Hebrew. Is it OK to assume this should always be RTL?

Consider the case where there is a mix of Arabic or Hebrew and English. What to do here? Should we segment and treat the English as LTR and the Arabic/Hebrew as RTL?

Re: Convert html to docx with RTL for hebrew/arabic language

PostPosted: Wed Jul 25, 2018 3:19 am
by kfir91
jason wrote:Hi there, interesting question :-)

Typically, a web page would indicate RTL like so:

Code: Select all
<p dir="rtl">ליצור מהרשת רשת כלל עולמית באמת!</p>


see https://www.w3.org/International/articl ... unctuation

Or use css, direction: rtl

docx4j-ImportXHTML uses Flying Saucer under the covers to process the XHTML (we read its css), but unfortunately, it doesn't pass the direction property. See https://groups.google.com/forum/#!topic ... lSAbpvP-zY and https://groups.google.com/forum/#!topic ... 0CfuYfpQ6I

So, 2 options.

Option 1: patch Flying Saucer () to keep RTL info. (It doesn't need to do anything with it for our purposes). See further
our fork at our fork https://github.com/plutext/flyingsaucer and https://github.com/flyingsaucerproject/ ... r/pull/138

(Interestingly, see also https://github.com/danfickle/openhtmltopdf/issues/9 which is based on FS...)

Option 2: if the current span contains Arabic or Hebrew unicode characters, assume it should be RTL. This might be quick/easy, but would that general rule produce expected results?

Consider the case where all characters (except punctuation?) are Arabic or Hebrew. Is it OK to assume this should always be RTL?

Consider the case where there is a mix of Arabic or Hebrew and English. What to do here? Should we segment and treat the English as LTR and the Arabic/Hebrew as RTL?



Hello Jason
tnx for your replay.
acutally i dont understood what i need to do?

when i take part of html code:
<p dir="rtl">
שלום מה נשמע? abcd אבג
</p>
With hebrew and english
its working great in simple html page.
but when i convert it to doc with docx4j there is problem with special chars like i demonstrate above.

how can i fix this problem with docx4j ? and my code?

tnx

Re: Convert html to docx with RTL for hebrew/arabic language

PostPosted: Wed Jul 25, 2018 6:42 am
by jason
You could answer these questions for me please :-)

Consider the case where a string of characters (except punctuation?) are Arabic or Hebrew. Is it OK to assume this should always be RTL?

Consider the case where there is a mix of Arabic or Hebrew and English. What to do here? Should we segment and treat the English as LTR and the Arabic/Hebrew as RTL?

Re: Convert html to docx with RTL for hebrew/arabic language

PostPosted: Thu Jul 26, 2018 6:18 am
by kfir91
jason wrote:You could answer these questions for me please :-)

Consider the case where a string of characters (except punctuation?) are Arabic or Hebrew. Is it OK to assume this should always be RTL?

Consider the case where there is a mix of Arabic or Hebrew and English. What to do here? Should we segment and treat the English as LTR and the Arabic/Hebrew as RTL?

Hello
Yes its always rtl

Re: Convert html to docx with RTL for hebrew/arabic language

PostPosted: Thu Jul 26, 2018 7:25 am
by jason
OK, I'll write something over coming days.

Re: Convert html to docx with RTL for hebrew/arabic language

PostPosted: Fri Jul 27, 2018 10:33 pm
by kfir91
jason wrote:OK, I'll write something over coming days.

tnx Jason i waiting to any update

Re: Convert html to docx with RTL for hebrew/arabic language

PostPosted: Sat Jul 28, 2018 11:07 am
by jason
Consider:

Code: Select all
<span>ליצור מהרשת 1234  רשת כלל עולמית באמת!</span>


Is the number to be handled like hebrew/arabic (ie rtl), or like English (ltr)

Consider a number between some hebrew/arabic and some English, for example [hebrew/arabic]1234[English]

Is the number to be considered RTL or LTR? Is there a convention, or is it ambiguous?

Re: Convert html to docx with RTL for hebrew/arabic language

PostPosted: Sat Jul 28, 2018 7:17 pm
by kfir91
jason wrote:Consider:

Code: Select all
<span>ליצור מהרשת 1234  רשת כלל עולמית באמת!</span>


Is the number to be handled like hebrew/arabic (ie rtl), or like English (ltr)

Consider a number between some hebrew/arabic and some English, for example [hebrew/arabic]1234[English]

Is the number to be considered RTL or LTR? Is there a convention, or is it ambiguous?

Hello
this is right syntax of hebrew sentence with numbers:

Image

Re: Convert html to docx with RTL for hebrew/arabic language

PostPosted: Wed Aug 01, 2018 1:31 pm
by jason
You can try https://docx4java.org/docx4j/docx4j-Imp ... 180801.jar

It contains https://github.com/plutext/docx4j-Impor ... f378022303

Could you please take a look at https://github.com/plutext/docx4j-Impor ... iTest.java and add additional tests for mixed Hebrew/Arabic and left to right text, especially for any cases where you feel the implementation is not correct.

Re: Convert html to docx with RTL for hebrew/arabic language

PostPosted: Mon Aug 06, 2018 8:43 pm
by kfir91
tnx a lot Jason