Page 1 of 1

Doubt in applying p tag styles while converting docx to html

PostPosted: Thu Nov 12, 2009 9:48 am
by siva19185
Hi I am converting some text content in docx to HTML. I have 3 doubts in handling them.

1. How to handle space between words. I just see <w:t xml:space="preserve"> in document.xml but unable to get any space through your conversion.

2. From Docx how to get Line-height to be set for p tag i.e how to handle line spacing in the docx? Where to get line spacing details from document.xml?

3. Any bottom-padding to be set for <p> tag by default to show space between lines?

Thanks in advance for your help.

Re: Doubt in applying p tag styles while convert docx to html.

PostPosted: Thu Nov 12, 2009 11:04 am
by jason
Hi

Are you using HtmlExporterNG2? These is now our preferred approach, and I'm happy to work through the issues you mention in using that.

For doubts 2 & 3, it would be handy if you identified the WordML property you are interested in, and a CSS property which would implement it.

Also, a short sample docx which exhibits all 3 problems; that will save me time and ensure we are talking about the same problems.

thanks .. Jason

Re: Doubt in applying p tag styles while convert docx to html.

PostPosted: Thu Nov 12, 2009 11:21 am
by siva19185
Hi Jason,
Thanks for your quick reply.
Sorry currently i am using only HtmlExporterNG. But how to hadle the empty space between words other than tab. Tab we can identify but how to measure the number of spaces left between words if its more than one space from document.xml i only find <w:t xml:space="preserve"> for sentence that have more than one space btw their word!


Relating to Qn 2&3 i made a docx and converted them to HTML using Word itself. The line spacing of 3 in Microsoft word is converted to line-height:300% in HTML and line spacing of 2 is converted to line-height:200% and so on in their conversion. This is applied to <p> tags style.
Similarly i found a default margin: 0 0 10pt in their converted HTML style applied to <P> tag for all word documents conversion to HTML.

Both were missing in your HTML conversion.

Thanks
Siva.

Re: Doubt in applying p tag styles while converting docx to html

PostPosted: Thu Nov 12, 2009 2:54 pm
by jason
siva19185 wrote:Sorry currently i am using only HtmlExporterNG.


ok, but any fixes I work on will be on NG2; I'll port them to NG only if it is easy.

siva19185 wrote:But how to hadle the empty space between words other than tab. Tab we can identify but how to measure the number of spaces left between words if its more than one space from document.xml i only find <w:t xml:space="preserve"> for sentence that have more than one space btw their word!


Word will put @xml:space="preserve" in certain circumstances (where the XML spec means it is significant).

One is where you have more than one w:r in a w:p, and one of the w:r ends in a space.

Your example also makes sense.

I had a look at Word 2007's HTML output. The HTML output doesn't contain anything special unless there are adjacent spaces. In this case, Word outputs <span style='mso-spacerun:yes'> </span>

You or I need to do some experiments to see whether various browsers collapse multiple spaces to 1, whether @style='mso-spacerun:yes' makes any difference, and whether the browsers parsing mode (strict or tag soup) makes a difference. If you could look into this, I can reflect your findings in the XSLT. I guess we can try &nbsp;

I think tabs are more of a challenge!

siva19185 wrote:Relating to Qn 2&3 i made a docx and converted them to HTML using Word itself. The line spacing of 3 in Microsoft word is converted to line-height:300% in HTML and line spacing of 2 is converted to line-height:200% and so on in their conversion. This is applied to <p> tags style.
Similarly i found a default margin: 0 0 10pt in their converted HTML style applied to <P> tag for all word documents conversion to HTML.
I couldnt find which part of document.xml represents the above styles.


line-height:115% comes from w:spacing/@w:line="276"/240. See the spec. The w:spacing/@w:line="276" is from w:pPrDefault in the styles part.

I don't know where the magic number 240 comes from. Maybe its that we're using 11pt font + 1pt x 20? Experimenting with other font sizes might explain this.

Interestingly, Microsoft's xslt uses a hardcoded /20 (resulting unit is pt):

Code: Select all
  <xsl:template match="w:spacing[@w:lineRule or @w:line]" mode="ppr">
    <xsl:choose>
      <xsl:when test="not(@w:lineRule) or @w:lineRule = 'exact'">
        line-height:<xsl:value-of select="@w:line div 20"/>pt;
      </xsl:when>
    </xsl:choose>
  </xsl:template>


siva19185 wrote:Similarly i found a default margin: 0 0 10pt in their converted HTML style applied to <P> tag for all word documents conversion to HTML.
I couldnt find which part of document.xml represents the above styles.


The margin-bottom comes from w:spacing/@w:after="200", again, in w:pPrDefault. 200 means 200 twips.

NG2 is not using w:pPrDefault; I will fix that. (iirc, NG does, but it doesn't handle the @w:line)

cheers

Jason

Re: Doubt in applying p tag styles while converting docx to html

PostPosted: Thu Nov 12, 2009 3:33 pm
by siva19185
Hi Jason,
Thanks for your reply and i will move to your NG2 java.

Regarding space between words even if we add &nbsp to solve it, but how will we know the number of &nbsp to be added if each word in a sentence have different number of space between them ?
Btw <span style='mso-spacerun:yes'> </span> didnt work for me. Which browser did you use that over? I think they are using a image between span because as far as the browsers are concerned they ignore any amount of space between words but MS Word HTML o/p is able to replicate the space in ditto as in the MS Word. Got any idea on how exactly they r able to bring space in the browsers O/p?

Thanks,
Siva.

Re: Doubt in applying p tag styles while converting docx to html

PostPosted: Fri Nov 13, 2009 3:35 am
by jason
siva19185 wrote:Regarding space between words even if we add &nbsp to solve it, but how will we know the number of &nbsp to be added if each word in a sentence have different number of space between them ?


We'd add one per space; either via XSLT or an extension function (though some browsers might still collapse multiple nbsp? can you do some testing to see what modern browsers do?)

siva19185 wrote:Btw <span style='mso-spacerun:yes'> </span> didnt work for me. Which browser did you use that over? I think they are using a image between span because as far as the browsers are concerned they ignore any amount of space between words but MS Word HTML o/p is able to replicate the space in ditto as in the MS Word. Got any idea on how exactly they r able to bring space in the browsers O/p?


<span style='mso-spacerun:yes'> I saw in view-source in Google Chrome; I assume it was in the HTML generated by Word 2007 when I had mulitple spaces together, but maybe not!

I haven't seen an image used as a space.

What is "o/p"? Output?

siva19185 wrote:
line-height:115% comes from w:spacing/@w:line="276"/240. See the spec. The w:spacing/@w:line="276" is from w:pPrDefault in the styles part.

Regarding this i dont think they are hardcoding anything. For eg if i have 3 separate lines in MS Word and apply differed line spacing as 1.5, 2 and 3 for each they apply them exactly as 115%,200%,300% in <P> tag of their HTML o/p. But in style.xml contains the same value all the time. Document.xml too doest have anything changed to show each sentence has diff line-spacing to apply.


Then divisor 240 seems to be hardcoded. (w:spacing/@w:line can be in w:p/w:pPr, or somewhere in the style hierarchy, or in the document defaults)

NG2 now supports this.

siva19185 wrote:
The margin-bottom comes from w:spacing/@w:after="200", again, in w:pPrDefault. 200 means 200 twips.

Have you applied it to the HTML o/p currently? Or can i hard code it as i dont see any margin for p tag as 0 0 10pt? How does word come to 0 0 10pt from a fixed 200 value for all documents?


See explanation above as to where the 200 comes from.

NG2 now supports this; see http://dev.plutext.org/trac/docx4j/changeset/972

So all that remains is the multiple spaces issue; I'll leave it to you to do some more exploring of what Word outputs, and how browsers behave. If you can summarise that, I'll implement it.

Re: Doubt in applying p tag styles while converting docx to html

PostPosted: Fri Nov 13, 2009 3:00 pm
by siva19185
Hi jason,
I found your implementation for linespacing in a new java. Thanks for your time & reply. Yup o/p means output!!! :)
Sure will try to work on and handle to get the space between words in browser. But my problem is how to identify the number of spaces between words, to be represented in HTML from document.xml.

I think we need to get the value of some other node which changes as change occur in MS Word which i am unable to find in document.xml or style.xml.
Please correct me if i am wrong. The issue persists.

Re: Doubt in applying p tag styles while converting docx to html

PostPosted: Mon Nov 16, 2009 2:41 am
by jason
siva19185 wrote:Sure will try to work on and handle to get the space between words in browser. But my problem is how to identify the number of spaces between words, to be represented in HTML from document.xml.


Don't worry about that. I can do that easily via an XSLT extension function or plain XSLT. Assume that if there are 5 spaces say in the docx in a w:t with xml:space="preserve', we can detect them and generate whatever we want for them.

All you need to do is experiment with handcoded HTML in the various web browsers, to determine what we should be generating.

My problem is that even if i change the padding bottom in MS Word i dont see a difference in w:spacing/@w:after="200" value in Style.xml.


I think you may be misunderstanding Word 2007 and the Open XML schemas work:
  • If you are directly applying formatting in Word 2007, you should see the formatting appear on the paragraph in document.xml.
  • If you are modifying the style definitions themselves, you will see the changes reflected in style.xml

My code creates a CSS rule to represent the doc defaults in style.xml:

Code: Select all
.DocDefaults {display:block;space-after: 4mm;line-height: 115%;font-size: 11.0pt;}


Similarly, if w:spacing is set in any style, you should see it in a corresponding CSS rule for that style.

Direct formatting in the docx translates to direct formatting in the CSS.

Example: line-height

Code: Select all
<w:p><w:pPr><w:spacing w:line="480" w:lineRule="auto"/></w:pPr>...


becomes

Code: Select all
<p class="Normal DocDefaults " style="line-height: 200%;"> ....


Example: space-after

Code: Select all
<w:p><w:pPr><w:spacing w:after="600"/></w:pPr>...


becomes

Code: Select all
<p class="Normal DocDefaults " style="space-after: 11mm;">

Re: Doubt in applying p tag styles while converting docx to html

PostPosted: Tue Nov 17, 2009 5:15 pm
by siva19185
Hi Jason,
Thanks for your reply and time. I think i had some problem with the Docx. I created a new Docx and it had the changes correctly as u said in document.xml. Sorry for troubling you due to the problem in my Docx . So i will continue to work on the spaces issue.

If possible please update me how we can detect the spaces because based on number of spaces got i will try to put the result in different forms, so that the browser display things correctly. So far in my testing   put between span tag gets converted to &nbsp by mozilla browser 3.5 & IE7 and create a empty space in HTML output!!!!

Also if possible please update with a nightly build for the work you had done on the P<tag> issue with line-spacing and bottom margin.

Thanks,
Siva.

Re: Doubt in applying p tag styles while converting docx to html

PostPosted: Wed Nov 18, 2009 1:08 am
by jason
siva19185 wrote:If possible please update me how we can detect the spaces because based on number of spaces got i will try to put the result in different forms, so that the browser display things correctly. So far in my testing   put between span tag gets converted to &nbsp by mozilla browser 3.5 & IE7 and create a empty space in HTML output!!!!


See http://vishalmanohar.wordpress.com/2008 ... s-in-html/ which references http://www.w3.org/TR/CSS21/text.html#white-space-prop

Maybe all that is required is to put that css setting on anything which had @xml:space='preserve', which we'd do in the XSLT.

If that doesn't work as expected, in the XSLT you could have a template which matches text and invokes an extension function which returns something suitable. But you can't do that until you know what you want to return, hence the need for testing in the web browsers first. When you test, as I said before, you need to note whether quirks mode or standards mode makes a difference; see http://www.alistapart.com/articles/doctype/

The HTML output should include a doctype; that's a TODO.

siva19185 wrote:Also if possible please update with a nightly build for the work you had done on the P<tag> issue with line-spacing and bottom margin.


Done, at http://dev.plutext.org/docx4j/docx4j-ni ... 091118.jar

Re: Doubt in applying p tag styles while converting docx to html

PostPosted: Thu Nov 19, 2009 2:54 pm
by jason
jason wrote:Maybe all that is required is to put that css setting on anything which had @xml:space='preserve', which we'd do in the XSLT.


I've done this in http://dev.plutext.org/trac/docx4j/changeset/974

As per the comment, IE7 does not honour it; but recent FF does, as do WebKit based browsers.