Page 1 of 1

Math equations and docx to html conversion not working

PostPosted: Sat Jan 21, 2012 5:06 am
by keithphw
Hi,

I'm completely new to docx4j, so please excuse my naive question.

I am trying to convert an MS Word 2007 docx file to html.

The docx files I want to convert have mathematical equations which i designed in MS Word using the equation editor, so the equations are not images. When I use the sample code to convert the docx file to html, it works very well, except that the equations are ignored. There is no html output where the equations should be. The logs suggest that certain math tags are ignored:
Code: Select all
7003 [main] WARN org.docx4j.convert.out.html.HtmlExporterNG2  - NOT IMPLEMENTED: support for m:oMath;
7205 [main] WARN org.docx4j.convert.out.html.HtmlExporterNG2  - NOT IMPLEMENTED: support for m:oMathPara;


I see that docx4j has a package 'org.docx4j.math' with lots of classes so I assume that there is a way to make equations work using docx4j but it is not obvious to me.

I attached my docx file that I'm trying to convert.

This is my code, copied from the samples:

Code: Select all
package docxconverter;

import java.io.File;
import java.io.OutputStream;

import org.docx4j.convert.out.Containerization;
import org.docx4j.convert.out.html.AbstractHtmlExporter;
import org.docx4j.convert.out.html.HtmlExporterNG2;
import org.docx4j.convert.out.html.SdtWriter;
import org.docx4j.convert.out.html.TagSingleBox;
import org.docx4j.convert.out.html.AbstractHtmlExporter.HtmlSettings;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;

/**
* If the source docx contained a WMF, that
* will get converted to inline SVG.  In order
* to see the SVG in your browser, you'll need
* to rename the file to .xml or serve
* it with MIME type application/xhtml+xml
*
*/
public class DocxToHtmlConverter {

   public static void main(String[] args)
         throws Exception {
      File inputfilepath = null;
      try {
         inputfilepath = new File("C:/Users/Keith/Desktop", "IndividualQuestion.docx");
      } catch (IllegalArgumentException e) {
         e.printStackTrace();
      }
      System.out.println(inputfilepath);

      boolean save = true;

      // Load .docx or Flat OPC .xml
      WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(inputfilepath);
      AbstractHtmlExporter exporter = new HtmlExporterNG2();
      HtmlSettings htmlSettings = new HtmlSettings();
      htmlSettings.setImageDirPath(inputfilepath + "_files");
      htmlSettings.setUserBodyTop("<H1>TOP!</H1>");
      htmlSettings.setUserBodyTail("<H1>TAIL!</H1>");

      // Sample sdt tag handler (tag handlers insert specific
      // html depending on the contents of an sdt's tag). 
      // This will only have an effect if the sdt tag contains
      // the string @class=XXX
//         SdtWriter.registerTagHandler("@class", new TagClass() );

      SdtWriter.registerTagHandler(Containerization.TAG_BORDERS, new TagSingleBox());
      SdtWriter.registerTagHandler(Containerization.TAG_SHADING, new TagSingleBox());
      
      OutputStream os;
      if (save) {
         os = new java.io.FileOutputStream(inputfilepath + ".html");
      } else {
         os = System.out;

      }

      javax.xml.transform.stream.StreamResult result = new javax.xml.transform.stream.StreamResult(os);
      exporter.html(wordMLPackage, result, htmlSettings);
      if (save) {
         System.out.println("Saved: " + inputfilepath + ".html using " + exporter.getClass().getName());
      }

   }
}


Here is the complete log of the output from using the above code together with the attached file:

Code: Select all
C:\Users\Keith\Desktop\IndividualQuestion.docx
log4j:WARN No appenders could be found for logger (org.docx4j.utils.ResourceUtils).
log4j:WARN Please initialize the log4j system properly.
11 [main] INFO org.docx4j.utils.Log4jConfigurator  - Since your log4j configuration (if any) was not found, docx4j has configured log4j automatically.
291 [main] INFO org.docx4j.jaxb.Context  - JAXB: RI not present.  Trying Java 6 implementation.
291 [main] INFO org.docx4j.jaxb.Context  - JAXB: Using Java 6 implementation.
291 [main] INFO org.docx4j.jaxb.Context  - loading Context jc
5380 [main] INFO org.docx4j.jaxb.Context  - loaded com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl .. loading others ..
5575 [main] INFO org.docx4j.jaxb.Context  - .. others loaded ..
5584 [main] INFO org.docx4j.openpackaging.contenttype.ContentTypeManager  - Detected WordProcessingML package
5594 [main] INFO org.docx4j.openpackaging.parts.Part  - /_rels/.rels
5595 [main] INFO org.docx4j.openpackaging.parts.relationships.RelationshipsPart  - unmarshalling org.docx4j.openpackaging.parts.relationships.RelationshipsPart
5603 [main] INFO org.docx4j.openpackaging.parts.Part  - /docProps/app.xml
5604 [main] INFO org.docx4j.openpackaging.parts.DocPropsExtendedPart  - unmarshalling org.docx4j.openpackaging.parts.DocPropsExtendedPart
5606 [main] INFO org.docx4j.openpackaging.parts.Part  - /docProps/core.xml
5607 [main] INFO org.docx4j.openpackaging.parts.DocPropsCorePart  - unmarshalling org.docx4j.openpackaging.parts.DocPropsCorePart
5610 [main] INFO org.docx4j.openpackaging.parts.Part  - /word/document.xml
5610 [main] INFO org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart  - For MDP, unmarshall via binder
5765 [main] INFO org.docx4j.openpackaging.parts.Part  - /word/_rels/document.xml.rels
5765 [main] INFO org.docx4j.openpackaging.parts.relationships.RelationshipsPart  - unmarshalling org.docx4j.openpackaging.parts.relationships.RelationshipsPart
5767 [main] INFO org.docx4j.openpackaging.parts.Part  - /word/webSettings.xml
5769 [main] INFO org.docx4j.openpackaging.parts.Part  - /word/settings.xml
5774 [main] INFO org.docx4j.openpackaging.parts.Part  - /word/styles.xml
5792 [main] INFO org.docx4j.openpackaging.parts.Part  - /word/theme/theme1.xml
5793 [main] INFO org.docx4j.openpackaging.parts.ThemePart  - unmarshalling org.docx4j.openpackaging.parts.ThemePart
5806 [main] INFO org.docx4j.openpackaging.parts.Part  - /word/fontTable.xml
6488 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/35191___.TTF is not embeddable; ignoring this font.
6489 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/40240___.TTF is not embeddable; ignoring this font.
6491 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/50416___.TTF is not embeddable; ignoring this font.
6492 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/51253___.TTF is not embeddable; ignoring this font.
6495 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/57930___.TTF is not embeddable; ignoring this font.
6495 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/57961___.TTF is not embeddable; ignoring this font.
6497 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/63193___.TTF is not embeddable; ignoring this font.
6498 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/65659___.TTF is not embeddable; ignoring this font.
6499 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/70214___.TTF is not embeddable; ignoring this font.
6500 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/70729___.TTF is not embeddable; ignoring this font.
6501 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/75678___.TTF is not embeddable; ignoring this font.
6503 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/78640___.TTF is not embeddable; ignoring this font.
6503 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/78936___.TTF is not embeddable; ignoring this font.
6504 [main] WARN org.docx4j.fonts.PhysicalFonts  - file:/C:/Windows/FONTS/89198___.TTF is not embeddable; ignoring this font.
6507 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/ALGER.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6539 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/BAUHS93.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6542 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/BERNHC.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6561 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/BROADW.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6580 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/CHILLER.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6620 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/ELEPHNTI.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6635 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/Gabriola.ttf (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6645 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/GIGI.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6656 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/HARLOWSI.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6657 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/HARNGTON.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6658 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/HATTEN.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6659 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/HP%20PSG.ttf (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6661 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/impact.ttf (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6664 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/ITCBLKAD.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6666 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/JOKERMAN.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6666 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/JUICE___.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6732 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/PLAYBILL.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6755 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/SNAP____.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6755 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/STENCIL.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6763 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/TEMPSITC.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6770 [main] WARN org.docx4j.fonts.PhysicalFonts  - Aborting: file:/C:/Windows/FONTS/TT0131M_.TTF (can't get EmbedFontInfo[] .. try deleting fop-fonts.cache?)
6818 [main] INFO org.docx4j.openpackaging.parts.WordprocessingML.StyleDefinitionsPart  - Style with name Normal, id 'Normal' is default paragraph style
6818 [main] INFO org.docx4j.openpackaging.parts.WordprocessingML.StyleDefinitionsPart  - Style with name Default Paragraph Font, id 'DefaultParagraphFont' is default character style
6911 [main] INFO org.docx4j.convert.out.html.HtmlExporterNG2  - /pkg:package
6932 [main] INFO org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart  - Preparing StyleTree
6940 [main] WARN org.docx4j.model.properties.Property  - Font 'Arial' is not mapped to a physical font.
6940 [main] WARN org.docx4j.model.properties.Property  - No mapping from null
6940 [main] WARN org.docx4j.convert.out.html.AbstractHtmlExporter  - ! null rPr for character style DefaultParagraphFont
7003 [main] WARN org.docx4j.convert.out.html.HtmlExporterNG2  - NOT IMPLEMENTED: support for m:oMath;
7205 [main] WARN org.docx4j.convert.out.html.HtmlExporterNG2  - NOT IMPLEMENTED: support for m:oMathPara;
7207 [main] WARN org.docx4j.convert.out.html.HtmlExporterNG2  - NOT IMPLEMENTED: support for m:oMathPara;
7210 [main] WARN org.docx4j.convert.out.html.HtmlExporterNG2  - NOT IMPLEMENTED: support for m:oMathPara;
7212 [main] WARN org.docx4j.convert.out.html.HtmlExporterNG2  - NOT IMPLEMENTED: support for m:oMathPara;
7220 [main] INFO org.docx4j.convert.out.html.HtmlExporterNG2  - wordDocument transformed to xhtml ..
Saved: C:\Users\Keith\Desktop\IndividualQuestion.docx.html using org.docx4j.convert.out.html.HtmlExporterNG2


If anyone has any pointers I would be very grateful.
Keith

Re: Math equations and docx to html conversion not working

PostPosted: Sat Jan 21, 2012 11:23 am
by jason
Hi Keith

As you've discovered docx4j 's HTML output currently can't handle equations.

docx-java-f6/need-to-handle-latex-equation-t293.html sketches an approach for doing this. That post is nearly 2 years old now; so some of the tools may have matured/changed since I wrote that. If you look further into this and find anything of interest, please share.

cheers .. Jason

Re: Math equations and docx to html conversion not working

PostPosted: Sat Jan 21, 2012 2:43 pm
by keithphw
Thank you for your swift answer!

I see that the poster (AnbuChezhian) in that forum link wanted to render the equations which is the ideal goal. But even if the equations were just shown as MathML (mml) that would suffice, since some browsers render MathML (such as firefox) and there are other ways to render it using javascript, such as MathJax: http://www.mathjax.org/.

I will try to edit the docx4j source code to turn whatever math tag is being used at the moment (omml?) and convert it to mml. I'll get back to you with my progress. sounds like i need to learn JAXB and how the docx4j source works first.

Thanks for your pointers.

Cheers,
Keith

Re: Math equations and docx to html conversion not working

PostPosted: Sat Jan 21, 2012 2:49 pm
by keithphw
Also, thanks for making this amazing library Jason and the other devs. It's incredibly useful.

Re: Math equations and docx to html conversion not working

PostPosted: Mon Jan 23, 2012 11:45 am
by jason
My pleasure. Look forward to hearing how you go.

Re: Math equations and docx to html conversion not working

PostPosted: Mon Jan 23, 2012 7:36 pm
by keithphw
Hi Jason,
I'm confused about how to do the omml (MS Word's Math ML) to mml (w3c's Math ML) translation.

I've managed to build the docx4j project using Maven. I've researched how XSL Transformers work, and identified that the transformation of docx to html is done in the method org.docx4j.convert.out.html.HtmlExporterNG2.html(WordprocessingMLPackage wmlPackage, javax.xml.transform.Result result, HtmlSettings htmlSettings).

This method uses the org/docx4j/convert/out/html/docx2xhtmlNG2.xslt file to do the transformation.

I also have the forum poster AnbuChezhian's microsoft file OMML2MML.XSL.

But I don't know how to use it to process the omml branches of the docx file. I am just trying to convert the docx file's omml branches to mml.

In this document (http://docs.oracle.com/javaee/1.4/tutor ... XSLT8.html) it says that I just need to chain the xslt Transformers together. I see that you use just one transform in the method org.docx4j.XmlUtils.transform(javax.xml.transform.Source source, javax.xml.transform.Templates template, Map<String, Object> transformParameters, javax.xml.transform.Result result).

I am unsure what to do next. I have a feeling that I need to do a second transform in HtmlExporterNG2.html(...) after: the line
org.docx4j.XmlUtils.transform(doc, xslt, htmlSettings.getSettings(), result); which does the omml to mml step using the OMML2MML.XSL.
I also think that I'll need to modify the HtmlExporterNG2.notImplemented(NodeIterator nodes, String message) method so that it doesn't suppress the output of the omml in the first step.

I will try to implement this. But if I am on the wrong track, could you please give me some pointers? :)

Thanks,
Keith

Re: Math equations and docx to html conversion not working

PostPosted: Tue Jan 24, 2012 12:27 am
by jason
Hi Keith

Do you have a simple docx which contains an example of OMML which you could post?

Based on a look at that i'll give you some pointers.

cheers .. Jason

Re: Math equations and docx to html conversion not working

PostPosted: Tue Jan 24, 2012 1:39 pm
by keithphw
Here's a short one that just shows the greek beta character:
Code: Select all
<m:oMath><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/></w:rPr><m:t>&beta;</m:t></m:r></m:oMath>


Here's a long one which shows some fractions:
Code: Select all
<m:oMathPara><m:oMathParaPr><m:jc m:val="left"/></m:oMathParaPr><m:oMath><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/></w:rPr><m:t xml:space="preserve">      </m:t></m:r><m:r><w:rPr><w:rFonts w:ascii="Cambria Math"/></w:rPr><m:t>=</m:t></m:r><m:f><m:fPr><m:ctrlPr><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/><w:i/></w:rPr></m:ctrlPr></m:fPr><m:num><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/></w:rPr><m:t>40</m:t></m:r></m:num><m:den><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/></w:rPr><m:t>40+80</m:t></m:r></m:den></m:f><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/></w:rPr><m:t>&times;0.5</m:t></m:r><m:r><w:rPr><w:rFonts w:ascii="Cambria Math"/></w:rPr><m:t>+</m:t></m:r><m:f><m:fPr><m:ctrlPr><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/><w:i/></w:rPr></m:ctrlPr></m:fPr><m:num><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/></w:rPr><m:t>80</m:t></m:r></m:num><m:den><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/></w:rPr><m:t>40+80</m:t></m:r></m:den></m:f><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/></w:rPr><m:t>&times;1.5</m:t></m:r></m:oMath></m:oMathPara>


These xml snippets are from the 'IndividualQuestion.docx' file that I attached in the previous post. If you unzip that file and look in: IndividualQuestion/word/document.xml you'll see where it fits in.

Thanks for your help.

Re: Math equations and docx to html conversion not working

PostPosted: Wed Jan 25, 2012 1:17 am
by jason
If that stuff is indeed OMML, you ought to be able to import or include your OMML2MML.XSL in docx2xhtmlNG2.xslt .

See http://www.xml.com/pub/a/2000/11/01/xslt/index.html

If you are having difficulty, I'd start by cut/paste your OMML sample into a separate XML file, and then try running OMML2MML.XSL on it. There are various tools around which will help you do this without needing to write a program, including Eclipse, and Xalan (command line).

Once you know the OMML2MML.XSL bit is working properly, you can try importing/including it.

Re: Math equations and docx to html conversion not working

PostPosted: Wed Jan 25, 2012 1:47 pm
by keithphw
Cheers Jason, I'll try that.

By the way, it's interesting how most of the problems that devs post on these forums, including me, are mostly due to a misunderstanding of how xml and xslt work.

To be honest, when I read about the existence of xslt and that is is really a way of programming in xml I was shocked. I always thought of xml as a way of storing data not as a programming language. Its syntax for programming is entirely foreign.

Out of interest, did you write the xslt that does the docx to html conversion?

Re: Math equations and docx to html conversion not working

PostPosted: Wed Jan 25, 2012 7:27 pm
by jason
keithphw wrote:Out of interest, did you write the xslt that does the docx to html conversion?


Yes :-)

Re: Math equations and docx to html conversion not working

PostPosted: Mon Jan 07, 2013 7:31 pm
by placeintime
hi jason and keith,
I'm a beginer of docx4j, and i have the familiar problem, I was so glad when i found this topic, then I try to build a demo to transform the docx with math equations.
I have put OMML2MML.XSL into the folder of docx2xhtmlNG2.xlst, and have add include in docx2xhtmlNG2.html,
<xsl:include href="OMML2MML.xsl"/>
but it still warn that NOT IMPLEMENTED: support for m:oMath;
keith said "I also think that I'll need to modify the HtmlExporterNG2.notImplemented(NodeIterator nodes, String message) method so that it doesn't suppress the output of the omml in the first step."
But the function notImplemented is an Aalan XSLT extension function, i think it just like a callback function from the jar or something like that, and the support check was run before this fucntion.
So, how to make it work?

Re: Math equations and docx to html conversion not working

PostPosted: Mon Jan 07, 2013 8:25 pm
by placeintime
Hi, jason and keith
sorry, i found out that i make a mistake, i put the OMML2MML.xls in the wrong folder.
I solved this, and now everything looks ok :)