Archive for November, 2010

Microsoft’s data binding patent

November 20th, 2010 by Jason

I just stumbled across
United States Patent 7730394, Data binding in a word-processing application

Its Microsoft’s patent on data bound content controls.

Its a useful description of how it works.

I’m not sure it’s worthy of a patent though.  They reference a  lot of prior art, but not my March 2004 paper  “XForms for Contract Semantics”, which contains the following binding example:

In consideration of the payment of <xforms:output ref=”lineitems/item/price”/>, <xforms:output ref=”supplier”/> agrees to deliver
a <xforms:output ref=”lineitems/item/name”/> to <xforms:output ref=”customer”/> on or before <xforms:output ref=”deliverydate”/>.

Interestingly to me, Wolters Kluwer referenced my paper in their “Document creation system” patent, but that’s a side note.

I’m a big fan of data-bound content controls.

So much so, in fact, that I’d like to see the same stuff included in ODF and implemented in OpenOffice .. umm .. maybe I mean LibreOffice these days!

That would obviously be more likely if Microsoft didn’t lodge patents for stuff like this.  Who can blame them, you might say, with things like i4i happening to them?  Well, my response is that they should be using their considerable corporate muscle to lobby for patent reform.  In the absence of such efforts, you can only conclude that the innovation inhibiting patent system suits Microsoft, event though they take the odd hundred million dollar hit from it.

docx4j v2.6.0 released

November 19th, 2010 by Jason

I published docx4j 2.6.0 yesterday.

For details, see the forum. This post introduces TraversalUtil, which makes it easier for you to find and change the bits of a docx you want to manipulate.

If you are working with an existing docx, you often need to get a particular bit of the document, and change it somehow.

If you know you want to change the 6th paragraph, say, that’s easy.

But if you want to find all occurrences of some item, which could occur at various different levels of the hierarchy (for example, paragraphs can appear not just in the document body, but also within table cells, and in content controls)?

docx4j offers a couple of different tools to make this easy.


XPath is a succinct way to select the things you need to change.

Happily, from docx4j 2.5.0, you can do use XPath to select JAXB nodes:

MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();

String xpath = "//w:p";

List<Object> list = documentPart.getJAXBNodesViaXPath(xpath, <strong>false</strong>); 

These JAXB nodes are live, in the sense that if you change them, your document changes.

There is a limitation however: the xpath expressions are evaluated against the XML document as it was when first opened in docx4j.  You can update the associated XML document once only, by passing true into getJAXBNodesViaXPath. Updating it again(with current JAXB 2.1.x or 2.2.x) will cause an error.

To workaround this bug in JAXB, you can marshall it, and then unmarshall the result using either:

    public org.docx4j.wml.Document unmarshal( is ) 

    public org.docx4j.wml.Document unmarshal(org.w3c.dom.Element el) 

Both of those will re-create the binder.

Not the most efficient, so consider voting for JAXB bug 459

But now we have an alternative…


New to docx4j 2.6.0 is a class TraversalUtil, which is a general approach for traversing the JAXB object tree in the main document part (though it can also be applied to headers, footers etc).

For example, to get a list of hyperlinks, you can do something like:

PHyperlinkFinder finder= new PHyperlinkFinder();
new TraversalUtil(paragraphs, finder);

static class PHyperlinkFinder extends CallbackImpl {
        List<P.Hyperlink> links = new ArrayList<P.Hyperlink>();  
		public List<Object> apply(Object o) {
			if (o instanceof P.Hyperlink)
			return null;

This approach is used extensively in the MergeDocx extension I discussed in my previous post. It is now also the basis of the OpenMainDocumentAndTraverse sample, so see that for another example of how to use it.

The example above simply finds relevant bits of the docx; you could also modify the objects encountered if you want.

Merging Word documents

November 14th, 2010 by Jason

I’ve written a utility to merge docx documents in Java.  “Merge” as in concatenate/join/append, as opposed to diff/merge (although docx4j does include code to do a diff, if you are looking for that instead).

With the utility, you can take 2 or more Word documents, and join them into one.

Edit Feb 2014. MergeDocx is now part of Plutext’s Docx4j Enterprise Edition.

As Eric White’s blog explained:

This programming task is complicated by the need to keep other parts of the document in sync with the data stored in paragraphs. For example, a paragraph can contain a reference to a comment in the comments part, and if there is a problem with this reference, the document is invalid. You must take care when moving / inserting / deleting paragraphs to maintain ‘referential integrity’ within the document.

With this utility, merging/concatenating documents is as easy as invoking the method:

public  WordprocessingMLPackage merge(List&lt;WordprocessingMLPackage&gt; wmlPkgs)

In other words, you pass a list of docx, and get a single new docx back.

Edit March 2014. You can try the MergeDocx and/or MergePptx functionality via the demo webapp.

This utility takes care of the niggly edge cases for you:

You can also use my MergeDocx utility to process a docx which is embedded as an altChunk.

Without this utility, you had to rely on Word to convert the altChunk to normal content.

That meant you had to round trip your docx through Word, before docx4j could create a PDF or HTML out of it.

Now you don’t.

To process the w:altChunk elements in a docx, you invoke:

public WordprocessingMLPackage process(WordprocessingMLPackage srcPackage)

You pass in a docx containg altChunks, and get a  new docx back which doesn’t.

But wait a minute .. if you can merge Word documents using this tool, why would you ever put an altChunk (containing a docx, as opposed to HTML) into the docx in the first place?

Ordinarily you wouldn’t, you’d just merge with this tool instead.  But there are at least 2 possibilities:

  • some upstream process put the altChunk there, and now you want to process it in docx4j
  • OpenDoPE.  The Open Document Processing Ecosystem convention is being extended in a v2.3 to allow other documents to be injected, and a natural thing is to convert an injection instruction to an altChunk.  Edit Feb 2014: docx4j 3.0.1 can also bind an XML element containing a base64 encoded docx, inserting it into the docx as an AltChunk.  MergeDocx can then convert that content into “real” docx content, suitable for including in a table of contents, or generating HTML or PDF.  The binding is two-way, so user edits in Word can be injected back into the XML (eg for persisting to a database).

There is one place my code differs significantly from how Word processes an altChunk, and that is in section handling.  When Word processes an altChunk, it seems to largely remove sectPr.  So for example, columns will disappear.  But it also might merge headers, so the resulting header contains stuff from the headers of both documents!  My code doesn’t do that: by default, it includes each section, and headers go with sections.