Archive for the ‘OOXML’ Category

Docx4jHelper Word AddIn

December 4th, 2014 by Jason

The dream:

  • View Open XML right from within Word, and see what happens when you edit it.
  • Or generate corresponding docx4j Java code, with deep links into the corresponding docx4j source code and Open XML spec.

Regular users of docx4j will be aware of our webapp, which amongst other things, generates docx4j Java code for the specified Open XML in your sample docx/pptx/xlsx.

The webapp is useful, but it has a few draw backs:

  • you have to upload your docx/pptx/xlsx, which takes time
  • if your docx/pptx/xlsx contains sensitive data, you probably want to remove that first
  • the webapp might be down

To address these issues, we’re now offering the code gen functionality as a Word AddIn.

If you install the Word AddIn, this means you can now generate code without your docx leaving your computer.

This is all feasible because docx4j can run as a DLL in a .NET project, thanks to IKVM!

Where to get it

You can download the installer.  After you complete the landing form (using your corporate email address, not gmail etc), you’ll be sent a download link.

Getting Started

After a successful installation, after restarting Word, you should see a “Docx4j” menu, containing:

To generate code, first press the “Load Helper” button.

You’ll see the following form:

Its inviting you to start a local web server which will run the same code as the existing webapp.  Just choose a port you aren’t already using.  If for some reason you want to browse using Internet Explorer (as opposed to your default browser), check the box.

It’ll take a little while to start the server; you’ll see a dialog when its started.

Now you can generate code.  To do so, select something in your docx, then click the “Generate Code” button.

After a while, a window will open in your web browser, and you’ll see:

That’s the view of the docx package, which will be familiar if you’ve used the webapp.   For how to generate code from here, see our earlier post.

Code generating is done on your computer.  (But note, the links on that page to docx4j source code and the OpenXML spec are external links)

What about the “Edit OpenXML” button?

If you select something in your docx, then click that button, after a while (maybe 30 secs the first time!), you’ll see the corresponding XML in an editor window:

You can go ahead and edit it, then click the “Apply” button.

If Word likes your XML, you’ll see your changes on the document surface.  Ctrl Z should work for undo.

So there are 2 ways to see the underlying XML

The first way we described uses your web browser; the second is a Windows Form.

These two views have different features; maybe a later release will unify them?

What about pptx, xlsx?

There’s no reason in principle we couldn’t make a similar AddIn for Powerpoint and Excel.  In fact, we plan to make these, once any teething issues have been ironed out in the WordAddIn.

In the meantime, for pptx and xlsx, you can continue to use the webapp.

Help, Suggestions and other Discussion

If you are a Plutext customer experiencing an issue, please email

Otherwise, please check the Docx4jHelper AddIn forum.

We’ve got some ideas for where the AddIn goes from here, but we’d love to hear yours.

SQL Server Reporting Services (SSRS) emits dodgy Word docx documents

May 12th, 2014 by Jason

By now we’re used to products which emit docx files which are umm, not .. quite .. right.

But its more noteworthy when the product in question is from Microsoft.  After all, its their file format (ECMA etc standardisation notwithstanding).

The product in question here is SQL Server Reporting Services 2012 and its Word export.

It seems they didn’t bother to validate their documents (eg using Open XML SDK 2.0 Productivity Tool):

Apparently there’s a reason for this:

“Word and SSRS treat page headers and footers differently. Word actually positions them inside the page margins, whereas SSRS positions them inside the area that the margins surround. As a result, in Word, the page margins do not control the distance between the top edge of the page and that of the page header (or similarly for the page footer). Instead, Word has separate “Header from Top” and “Footer from Bottom” properties to control those distances. Since RDL does not have equivalent properties, the Word renderer sets these properties to zero.”
But the problem is that it is actually setting them to blank (as opposed to zero), which is not valid.

Another problem:

JAXB doesn’t like invalid documents, so docx4j has to fix these sorts of things before it can construct a content model.  (Maybe that’s why SSRS calls it Word export, not docx export:- they just check Word can open the document, then call it job done)

There are other problems with SSRS docx which the Productivity Tool doesn’t report.

Take a look at the styles part:

Notice anything wrong?  It’d be better if the EmptyCellLayoutStyle had @w:styleId and @w:type, like so:

It’d also be nice if it defined the “Normal” style it is basedOn!

docx4j and other consumers could/should detect such problems and degrade gracefully in the face of them, but Microsoft (of all companies!) should exercise better quality control.

docx4j 2.7.0 released

July 8th, 2011 by Jason

I’m pleased to announce the release today of docx4j 2.7.0.

What is docx4j?

docx4j is an open source (Apache v2) library for creating, editing, and saving OpenXML “packages”, including docx, pptx, and xslx.  it is similar to Microsoft’s OpenXML SDK, but for Java rather than .NET.   It uses JAXB to create the Java objects out of the OpenXML parts.

Notable features for docx include export as HTML or PDF, and CustomXML databinding for document generation (including our OpenDoPE convention support for processing repeats and conditions).

The docx4j project started in October 2007.

What’s new?

This is mainly a maintenance release; things of note include:

  • Improvements to Maven build
  • ContentAccessor interface
  • AlteredParts: identify parts in this pkg which are new or altered; Patcher
    which adds new or altered parts.
  • Support for .glox SmartArt package (/src/glox/)
  • JAXB RI 2.2.3 compatibilty
  • OpenDoPE support improvements

Where do you get it?

Binaries: You can download a jar alone or a tar.gz with all deps or pick and choose.

Source: Checkout the source from SVN (use the pom.xml file to satisfy the dependencies eg with m2eclipse, or download them from one of the links above)

Maven: Please see forum for details (since XML doesn’t paste nicely here right now).

Dependency changes

Antlr is now required for OpenDoPE processing; this gives us better XPath processing.  The required jars are:

Getting Started

See the “Getting Started” guide.

Thanks to our contributors

A number of contributions have made this release what it is; thanks very much to those who contributed.

Contributors to this release and a more complete list of changes may be found in README.txt

A request to docx4j users

If you are happily using docx4j, it would be great if you could reply to this post with some words of recommendation for others who might be wondering whether docx4j is a good choice. I know there are thousands of you out there :-)

Some users have been kind enough to make such statements already; these may be found on the trac homepage.

Of course, there are a number of other ways you can contribute back.  Please consider doing so, especially if you think you might find yourself looking for support from volunteers in the docx4j forums.

modified Office Open XML schema now in Subversion

April 30th, 2008 by Jason

We’ve been tweaking the schemas – especially wml.xsd – to make the Java classes generated by JAXB’s xjc more user-friendly.

I’m satisfied that this is permitted by ECMA, so I’ve put the modified schemas into subversion .

For anyone interested in the reasoning, the Ecma website says:

“Ecma Standards and Technical Reports are made available to all interested persons or organizations, free of charge and copyright, in printed form and, as files in Acrobat (R) PDF format.”

For this to apply, it needs to be an “Ecma Standards or Technical Report”.

That page says “A Standard or a Technical Report is a formal document prepared by an Ecma Technical Committee and approved by the Ecma General Assembly.”

Office Open XML was so approved.

So the only possible glitch would be words to the effect that the schema aren’t part of the official standard.

I’ve checked the language in parts 2 and 4 (of the Ecma TC45 Final Draft) which says “This Office Open XML specification includes a family of schemas … The normative definition of these schemas reside in an accompanying file named … which is distributed in electronic form only.”

Which makes it clear the schemas are part of the Standard :)

So the ECMA standard’s XSD are “free of copyright” – an explicit waiver of copyright. So no problemo in creating derivative works.

docx4j now released under Apache License

April 10th, 2008 by Jason

We’re pleased to announce that docx4j is now available under the Apache License (v2).

This is a response to feedback on an earlier post.  This is also the last license change we’ll be making to docx4j. Word documents are mostly manipulated in corporate environments.  This change removes barriers to adoption of docx4j by business and institutions.

docx4j uses to efficiently turn streams inside out. That package had been available under the GPL.  Its author, Merlin Hughes, today kindly released it under v2 of the Apache License, so we now use it under that license.

There’s a new nightly build of docx4j available from the downloads page if you want to grab it.  This build can load/save to/from a WebDAV server – more on that in another post.

Microsoft Office Online .. soon?

March 3rd, 2008 by Jason

Nick Carr has sparked speculation that Microsoft will soon unveil its strategy for bringing its Office suite online – which to me means a way of working with Office documents on any computer which has an internet connection.  If you are connected, I’d expect you to be able to collaborate with others in real time; if you are not connected, I’d expect the software to work in offline mode.

When I say “any computer”, I don’t mean to restrict that to any particular operating system (and indeed, Silverlight runs on the Mac, and Microsoft has announce it is working with Novell on a linux implementation).  What good is collaboration software if some of the people you need to collaborate with can’t play?I thought I’d make some predictions about the business model.

There seem to be 2 key questions:

  •  does each end user pay, or does a collaboration originator pay for the right to invite a certain number of collaborators?
  • what support for Mac and Linux users, and when?

Whether each individual user is required to pay, or the originator pays, will reveal much about how Microsoft regards its online offering.  The latter model, that the person who originates a collaboration session pays for a certain number of people to be able to collaborate (ie whatever their platform), would show that their focus is firmly on collaboration.  This is the model we would use for any plutext SAAS offering (available to people who don’t want to install plutext server internally, for free or a fee). 

Here are my predictions:

  1. Enterprise version (ie behind the firewall).  There will be a version an enterprise can install on its Sharepoint server, for those businesses which are not comfortable with their documents being hosted externally.  I’m sure Microsoft can work out how to let people give access to people outside the firewall as necessary.  An enterprise licensee will be able to invite people outside the enterprise without charge.
  2. Cloud version. I expect there will be a cloud version for SMBs.  I think you will be able to use this for free, provided you have a license for the traditional Office product.  You will definitely need this (2007 version) to originate collaboration around a document (ie invite other users) – unless you are prepared to pay a full price for the online offering.  Maybe anyone will be able to accept a collaboration invitation (ie whether or not they are licensed to use Office), making the “who pays” question mute.  To create a new document (or print it?), I expect you will need to have a licence for the traditional Office product, or pay for the SAAS offering.
  3. Mac and Linux support.  I think Microsoft will offer Mac support sooner or later, but delay any hint of support for Linux for as long as possible.  This is because Linux is much more of a threat than OSX (two reasons: (1) Linux is free, and (2) it is very easy to install it on your existing Windows PC).  That said, they might have it “only on Windows” to try to keep people there – until some critical tipping point is reached.  I would say that even now, the only thing stopping Microsoft from seeking revenues from Linux users are the inevitable press headlines along the lines of “Microsoft admits defeat” that would come with this.  The cost of this in terms of perception would surely outweigh any incremental revenues in the short term.  Mac users may be able to use it for free – provided they had an Office license they were able to associate with their online user ID.  
  4. docx only. The documents which come out of this online service will be docx documents, not binary or RTF.  This will help to make the new format ubiquitous.

I wonder whether the collaboration protocols will be published under the recent interoperability initiative?  If they are, the way would be open for a rich world, in which docx4all could potentially play…  I’d be pleasantly surprised if they were, and there was nothing stopping someone from making a client or server of their own.  If anyone else could create a server, then why not get rid of it altogether and go peer-to-peer?  Maybe, just maybe, the thinking is that it would take forever for someone other than Microsoft to create a fully featured server, so third party implementations are to be encouraged (as is presently the case for OpenXML), since Microsoft’s offering will always be the RollsRoyce implementation which attracts the most usage, with the other implementations adding value to the ecosystem.

 The announcement, if/when it comes, will be fascinating!  (more…)

Styles and numbering

January 11th, 2008 by Jason

This week, thanks to JAXB, we added strongly typed content models for the Styles part, and the Numbering definitions part of a docx.

Have a look at and, used by their respective parts.

Tutorial: opening an existing document with docx4j

January 11th, 2008 by Jason

I’ve added a page to the wiki, showing how easy it is to programmatically open and edit an existing document.

OOXML, boolean values and binding

January 6th, 2008 by Jason

ST_OnOff is used extensively in the XML Schema. Here is the link (nice resource!).

Basically, it is used for things which should use the built in boolean schema data type:

This simple type specifies a set of values for any binary (on or off) property defined in a WordprocessingML document.

For example, the b (bold) element has an attribute @val of type ST_OnOff.

There are several problems with how this is done.

The first is that its possible values are “on, 1, or true”. OOXML should just use the XSD boolean data type, which doesn’t allow “on” (or “off”). For related comments, see here, here, and here. Denmark and France seem to be the strongest advocates of the use of xsd:boolean, and I hope they get their way.

The second is that it is left to the specification text to say that if the attribute is omitted, its value is implied to be true. That should be expressed as part of the schema.

For CT_OnOff, it would be:

<xsd:complexType name=”BooleanDefaultTrue”>
<xsd:attribute name=”val” type=”xsd:boolean” default=”true” />

I don’t think Denmark or anyone else made this second point.

The schema we are using in docx4j to generate classes uses these sorts of definitions instead of ST_OnOff or CT_OnOff.  For CT_OnOff, this results in a BooleanDefaultTrue type, which is used in fields like (for bold):

protected BooleanDefaultTrue b

Which brings me to the the third problem with ST_OnOff (and the schema in general), which is that it generates ugly code in JAXB and other binding frameworks (presumably .NET included). The built in schema data types produce much nicer code.

As a general remark, running the schema through JAXB is a good way to find places where the schema can be improved. Schema design goals should include:

  1. that it can be processed out of the box by binding frameworks (since that makes it easier for people to pick up a schema and start using it). [This is not currently the case]
  2. that the schema be expressed in such a way as to generate the simplest code.