Archive for the ‘docx4j’ Category

docx4j license change

January 11th, 2008 by Jason

A note for the record that we’ve changed the docx4j license from the GPL v3 to the Affero General Public License v3.   All users of which we are aware are happy with this change.

The logic for the change is the same as the logic for licensing plutext-server under the Affero GPL.  That is, to ensure that people who use docx4j in a SAAS environment are treated the same as people who distribute docx4j to end users.

Licensing docx4j under an Apache style licence also has its attractions – let us know if this would make a difference to you.

OOXML, boolean values and binding

January 6th, 2008 by Jason

ST_OnOff is used extensively in the XML Schema. Here is the openiso.org link (nice resource!).

Basically, it is used for things which should use the built in boolean schema data type:

This simple type specifies a set of values for any binary (on or off) property defined in a WordprocessingML document.

For example, the b (bold) element has an attribute @val of type ST_OnOff.

There are several problems with how this is done.

The first is that its possible values are “on, 1, or true”. OOXML should just use the XSD boolean data type, which doesn’t allow “on” (or “off”). For related comments, see here, here, and here. Denmark and France seem to be the strongest advocates of the use of xsd:boolean, and I hope they get their way.

The second is that it is left to the specification text to say that if the attribute is omitted, its value is implied to be true. That should be expressed as part of the schema.

For CT_OnOff, it would be:

<xsd:complexType name=”BooleanDefaultTrue”>
<xsd:attribute name=”val” type=”xsd:boolean” default=”true” />
</xsd:complexType>

I don’t think Denmark or anyone else made this second point.

The schema we are using in docx4j to generate classes uses these sorts of definitions instead of ST_OnOff or CT_OnOff.  For CT_OnOff, this results in a BooleanDefaultTrue type, which is used in fields like (for bold):

protected BooleanDefaultTrue b

Which brings me to the the third problem with ST_OnOff (and the schema in general), which is that it generates ugly code in JAXB and other binding frameworks (presumably .NET included). The built in schema data types produce much nicer code.

As a general remark, running the schema through JAXB is a good way to find places where the schema can be improved. Schema design goals should include:

  1. that it can be processed out of the box by binding frameworks (since that makes it easier for people to pick up a schema and start using it). [This is not currently the case]
  2. that the schema be expressed in such a way as to generate the simplest code.

docx4j trunk now uses JAXB

December 22nd, 2007 by Jason

10 days ago, we created a proof of concept for using JAXB on a subset of wml.xsd (one of the OpenXML schema files).

We’ve declared that a success, and moved it from a branch into the trunk of docx4j. Here be the generated classes.

plutext-server has now been migrated to use it.

And Jo is working with it as he codes docx4all.

So we’re pretty committed at this point!

We’re tidying up bits of the object model as we go (ie editing our xsd to generate Java that we like). So far, paragraphs (p, pPr, r, rPr, t) and structured document tags (sdt, sdtPr, sdtContent) have had our attention.

We’re also making a few changes to the generated classes, so we need to think about how best to prevent those changes from getting lost when the classes are re-generated. There’s a bit of support in XJC for this, and diff may come in handy, but I’d love to hear best practices.

What we have now is an object model for key pieces of the Main Document part (document.xml), in package name
org.docx4.jaxb.document. Next cab off the rank is the Styles part, which we’ll put in org.docx4.jaxb.styles.

Docx4j branch: Using JAXB to unmarshall OOXML to Java

December 12th, 2007 by Jason

docx4j contains classes which represent key parts of a WordprocessingML docx.

For example, we have a paragraph class to represent the p element; another to represent a run, etc.

Each class knows how to unmarshall its docx XML representation, and marshall it again.

It will create specialised objects for things it knows how to handle (for example the paragraph content collection contains run objects). For XML we don’t have strongly typed objects for, the class will simply store that XML node, so that it can be round tripped.

Instead of coding these classes one by one by hand, we wanted to see whether one of the Java-XML data binding frameworks could make our lives easier.

Given there is a standard for doing this (JSR 222 – The Java Architecture for XML Binding), we tried the JAXB reference implementation (JAXB 2.1.5).

The JAXB web presence leaves a lot to be desired. I’ll write a post on that shortly.

Having said that, I’m quite impressed with the spec and the reference implementation.

You feed your schema into xjc, and it generates Java classes.

The @XmlAnyElement annotation allows unknown elements to be round tripped, mimicking our existing code.

Why would there be any unknown elements you ask?

The answer is that we are using a subset of the wml.xsd schema from TC45. So there can be a lot of stuff in a docx document which falls outside the subset.

There are a number of reasons we are using a subset:

  1. running the entire schema through XJC produces lots of errors, both in the parsing phase, and once you overcome those, in the compiling stage
  2. more importantly, we’re unlikely to ever implement the entire WordML spec. So it makes sense to work with the subset of key features which are on our roadmap.
  3. you have to add annotations to the schema to ensure the resulting Java classes use names which make sense (this is called customizing the binding).

Anyway, this approach seems to work well. That is:

  • the JAXB version can read a Word document, edit it, and save it again, and Word 2007 can consume the result. See sample.java
  • the resulting classes can be made quite intuitive (though there is more tweaking to do)
  • unknown elements can be round-tripped

The JAXB version of docx4J is in subversion at the following branch:

http://dev.plutext.org/trac/docx4j/browser/branches/jaxb

You can’t just check out the branch and use it right now, since
classes need to be generated. There are maybe 50 generated, but I have
only committed 3 of them.

Where to from here?

If this approach continues to look promising, we are likely to move the JAXB code into the trunk, and upgrade plutext-server and docx4all to use it.

dev.plutext.org is live!

November 15th, 2007 by Jason

Today our server went live!

Its our own server, co-located, since co-location still seems to provide much better bang for buck than renting space on someone else’s box.

The wiki contains some basic content, and docx4j has been loaded into subversion.

We’ve also posted a job on elance to have the font and colour schemes made consistent across the WordPress, trac and phpBB components. So the site should look more like a coherent whole in a week or two. The next job will be the logo…

Now is a good time to thank the people at Edgewall and Thoughtworks (for Buildix), WordPress, and phpBB for writing the software which makes this site possible – not to mention the rest of the stack: Apache, Linux etc etc.

Cheers!