Page 1 of 1

Creating XML from docx

PostPosted: Wed Feb 20, 2013 11:17 am
by rhaley
Hello,
I am a very new user to docx4j. It's has great potential for allowing our users to author documents/articles in Word without having to touch an XML editor.

One problem that I need to solve is that I think I need a package: org.docx4j.convert.out.xml (similar to .....html) that will allow custom conversion from docx to our proprietary XML format.

I have been trying to use org.docx4j.convert.out.html.HtmlExporterNG2 with the sample class org.docx4j.samples.ConvertOutHtml and I have had some success but I need to save various calls in the stylesheet (docx2xhtml.xslt) to variables to later restyle.

Here is an example that I would need to rework:
Code: Select all
<xsl:copy-of select="java:org.docx4j.model.images.WordXmlPictureE20.createHtmlImgE20(
           $conversionContext,
           $wpinline)" />
// changed to
<xsl:variable name="image" select="java:org.docx4j.model.images.WordXmlPictureE20.createHtmlImgE20(
           $conversionContext,
           $wpinline)" />
<xsl:varliable name="path" select="$image//@src"/>
<build custom tag here>


I guess I am looking for the best approach. If I am using it then great! But I don't think I am. Ideally I would love to beable to perhaps map word xml elements to our custom xml elements.
For example
Code: Select all
w:p = Para
w:tbl = Table
w:drawing = StillImageExhibit
w:p[w:pPr/w:pStyle/@w:val = "ListParagraph'] = ListItem


or allow the user to pass in the tag they want to generate when converting w:p to Para.

I started to create org.docx4j.convert.out.xml locally but I didn't want to go to far before a posted something.

Thanks in advance.

Re: Creating XML from docx

PostPosted: Wed Feb 20, 2013 11:21 am
by rhaley
I also would like to disable Font processing as XML doesn't care about style.

Re: Creating XML from docx

PostPosted: Wed Feb 20, 2013 12:19 pm
by jason
An alternative would be to work from https://github.com/plutext/docx4j/blob/ ... nXSLT.java

Depending exactly what state you need, you ought to be able to maintain that easily enough...

Re: Creating XML from docx

PostPosted: Wed Feb 20, 2013 6:52 pm
by rhaley
I am trying to minimize the dependency on Java Developers and allow more people to be involved in the transformation steps.

We have quite a few people that are proficient in XSLT.

do you think the mapping w:p to Para and other such tags is an option?

Re: Creating XML from docx

PostPosted: Wed Feb 20, 2013 7:34 pm
by jason
rhaley wrote:do you think the mapping w:p to Para and other such tags is an option?


Mapping w:p to Para is straightforward.

Without more details on your other mapping requirements, its difficult to assess what if any complications may arise.

Another option entirely: save your docx as "Flat OPC XML" (in Word, save as XML), then you can potentially use a pure XSLT approach if that is what you prefer (ie no need for docx4j).

If your transformation doesn't need to read styles.xml, numbering.xml, headers/footers, footnotes/endnotes, comments, or images, you could just unzip and grab document.xml, and transform that.

Re: Creating XML from docx

PostPosted: Thu Feb 21, 2013 5:37 am
by rhaley
jason wrote:Another option entirely: save your docx as "Flat OPC XML" (in Word, save as XML), then you can potentially use a pure XSLT approach if that is what you prefer (ie no need for docx4j).


This wont work as I need to keep the users as far away from XML as possible.

jason wrote:If your transformation doesn't need to read styles.xml, numbering.xml, headers/footers, footnotes/endnotes, comments, or images, you could just unzip and grab document.xml, and transform that.


Unfortunately, I need to read endnotes.xml and images.

jason wrote:Mapping w:p to Para is straightforward.


I would love to know how?

jason wrote:Without more details on your other mapping requirements, its difficult to assess what if any complications may arise.


Well... w:p mapping to <Para> with some required attributes.

Re: Creating XML from docx

PostPosted: Mon Feb 25, 2013 6:49 pm
by jason
So you want to stay away from Java, and from XML/XSLT ;-)

Maybe the best docx4j could do for you is to provide a simplified representation of the docx (paragraphs, tables, images) which is nicely hierarchical XML ie hierarchical containers for H1, H2 etc, and lists detected and nested.

You'd then transform this nice XML to your particular dialect.