Archive for the ‘Microsoft Word’ Category

SQL Server Reporting Services (SSRS) emits dodgy Word docx documents

May 12th, 2014 by Jason

By now we’re used to products which emit docx files which are umm, not .. quite .. right.

But its more noteworthy when the product in question is from Microsoft.  After all, its their file format (ECMA etc standardisation notwithstanding).

The product in question here is SQL Server Reporting Services 2012 and its Word export.

It seems they didn’t bother to validate their documents (eg using Open XML SDK 2.0 Productivity Tool):

Apparently there’s a reason for this:

“Word and SSRS treat page headers and footers differently. Word actually positions them inside the page margins, whereas SSRS positions them inside the area that the margins surround. As a result, in Word, the page margins do not control the distance between the top edge of the page and that of the page header (or similarly for the page footer). Instead, Word has separate “Header from Top” and “Footer from Bottom” properties to control those distances. Since RDL does not have equivalent properties, the Word renderer sets these properties to zero.”
But the problem is that it is actually setting them to blank (as opposed to zero), which is not valid.

Another problem:

JAXB doesn’t like invalid documents, so docx4j has to fix these sorts of things before it can construct a content model.  (Maybe that’s why SSRS calls it Word export, not docx export:- they just check Word can open the document, then call it job done)

There are other problems with SSRS docx which the Productivity Tool doesn’t report.

Take a look at the styles part:

Notice anything wrong?  It’d be better if the EmptyCellLayoutStyle had @w:styleId and @w:type, like so:

It’d also be nice if it defined the “Normal” style it is basedOn!

docx4j and other consumers could/should detect such problems and degrade gracefully in the face of them, but Microsoft (of all companies!) should exercise better quality control.

Hello Maven Central

October 29th, 2011 by Jason

With version 2.7.1, docx4j – a library for manipulating Word docx, Powerpoint pptx, and Excel xlsx xml files in Java – and all its dependencies, are available from Maven Central.

This makes it really easy to get going with docx4j.  With Eclipse and m2eclipse installed, you just add docx4j, and you’re done.  No need to mess around with manually installing jars, setting class paths etc.

This post demonstrates that, starting with a fresh OS (Win 7 is used, but these steps would work equally well on OSX or Linux).

Step 1 – Install the JDK

For the purposes of this article, I used JDK 7, but docx4j works with Java 6 and 1.5.

Step 2 – Install Eclipse Indigo (3.7.1)

I normally download the version for J2EE developers. Unzip it and run eclipse

Step 3 – Install m2eclipse.

In Eclipse, click Help > Install New Software.

Type “” in the “Work with” field as shown:

then follow the prompts.

Step 4 – Create your Maven project

In Eclipse, File > New > Project.., then choose Maven project

You should see:

Check “Create a simple project (skip archetype selection)” then press next.

Allocate group and artifact id (what you choose as your artifact id will become the name of your new project in Eclipse):

Press finish

This will create a project with directories using Maven conventions:

(Note: If your starting point is a new or existing Java project in Eclipse, you can right click on the project, then choose Configure > Convert to Maven project)

Step 5 – Add docx4j to your POM

Double Click on pom.xml

Next click on the dependencies tab, then click the “add dependency” button, and enter the docx4j coordinates as shown in the image below:

The result is this pom:

<project xmlns="" xmlns:xsi="" xsi:schemaLocation="">

Ctrl-S to save it.

m2eclipse may take some time to download the dependencies.

When it has finished, you should be able to see them:

Step 6 – Create

If you made a Maven project as per step 4 above, you should already have src/main/java on your build path.

If not, create the folder and add it.

Now add a new class:

import org.docx4j.openpackaging.packages.WordprocessingMLPackage;

public class HelloMavenCentral {

	public static void main(String[] args) throws Exception {

		WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();

			.addStyledParagraphOfText("Title", "Hello Maven Central");

		wordMLPackage.getMainDocumentPart().addParagraphOfText("from docx4j!");

		// Now save it"user.dir") + "/helloMavenCentral.docx") );


Step 7 – Click Run

When you click run, all being well, a new docx called helloMavenCentral.docx will be saved.

You can open it in Word (or anything else which can read docx), or unzip it to inspect its contents.

Step 8 – Adding

One final thing. If you plan on creating documents from scratch using docx4j, it is useful to set paper size etc, via Put something like the following on your path:

# Page size: use a value from org.docx4j.model.structure.PageSizePaper enum
# eg A4, LETTER
# Page size: use a value from org.docx4j.model.structure.MarginsWellKnown enum

# Page size: use a value from org.pptx4j.model.SlideSizesWellKnown enum
# eg A4, LETTER

# These will be injected into docProps/app.xml
# if App.Write=true
# of the form XX.YYYY where X and Y represent numerical values

# These will be injected into docProps/core.xml


# If you haven't configured log4j yourself
# docx4j will autoconfigure it.  Set this to true to disable that

And that’s it. For more information on docx4j, see our Getting Started document.

Please click the +1 button if you found this article helpful.

docx4j 2.7.0 released

July 8th, 2011 by Jason

I’m pleased to announce the release today of docx4j 2.7.0.

What is docx4j?

docx4j is an open source (Apache v2) library for creating, editing, and saving OpenXML “packages”, including docx, pptx, and xslx.  it is similar to Microsoft’s OpenXML SDK, but for Java rather than .NET.   It uses JAXB to create the Java objects out of the OpenXML parts.

Notable features for docx include export as HTML or PDF, and CustomXML databinding for document generation (including our OpenDoPE convention support for processing repeats and conditions).

The docx4j project started in October 2007.

What’s new?

This is mainly a maintenance release; things of note include:

  • Improvements to Maven build
  • ContentAccessor interface
  • AlteredParts: identify parts in this pkg which are new or altered; Patcher
    which adds new or altered parts.
  • Support for .glox SmartArt package (/src/glox/)
  • JAXB RI 2.2.3 compatibilty
  • OpenDoPE support improvements

Where do you get it?

Binaries: You can download a jar alone or a tar.gz with all deps or pick and choose.

Source: Checkout the source from SVN (use the pom.xml file to satisfy the dependencies eg with m2eclipse, or download them from one of the links above)

Maven: Please see forum for details (since XML doesn’t paste nicely here right now).

Dependency changes

Antlr is now required for OpenDoPE processing; this gives us better XPath processing.  The required jars are:

Getting Started

See the “Getting Started” guide.

Thanks to our contributors

A number of contributions have made this release what it is; thanks very much to those who contributed.

Contributors to this release and a more complete list of changes may be found in README.txt

A request to docx4j users

If you are happily using docx4j, it would be great if you could reply to this post with some words of recommendation for others who might be wondering whether docx4j is a good choice. I know there are thousands of you out there :-)

Some users have been kind enough to make such statements already; these may be found on the trac homepage.

Of course, there are a number of other ways you can contribute back.  Please consider doing so, especially if you think you might find yourself looking for support from volunteers in the docx4j forums.

Microsoft’s data binding patent

November 20th, 2010 by Jason

I just stumbled across
United States Patent 7730394, Data binding in a word-processing application

Its Microsoft’s patent on data bound content controls.

Its a useful description of how it works.

I’m not sure it’s worthy of a patent though.  They reference a  lot of prior art, but not my March 2004 paper  “XForms for Contract Semantics”, which contains the following binding example:

In consideration of the payment of <xforms:output ref=”lineitems/item/price”/>, <xforms:output ref=”supplier”/> agrees to deliver
a <xforms:output ref=”lineitems/item/name”/> to <xforms:output ref=”customer”/> on or before <xforms:output ref=”deliverydate”/>.

Interestingly to me, Wolters Kluwer referenced my paper in their “Document creation system” patent, but that’s a side note.

I’m a big fan of data-bound content controls.

So much so, in fact, that I’d like to see the same stuff included in ODF and implemented in OpenOffice .. umm .. maybe I mean LibreOffice these days!

That would obviously be more likely if Microsoft didn’t lodge patents for stuff like this.  Who can blame them, you might say, with things like i4i happening to them?  Well, my response is that they should be using their considerable corporate muscle to lobby for patent reform.  In the absence of such efforts, you can only conclude that the innovation inhibiting patent system suits Microsoft, event though they take the odd hundred million dollar hit from it.

Merging Word documents

November 14th, 2010 by Jason

I’ve written a utility to merge docx documents in Java.  “Merge” as in concatenate/join/append, as opposed to diff/merge (although docx4j does include code to do a diff, if you are looking for that instead).

With the utility, you can take 2 or more Word documents, and join them into one.

Edit Feb 2014. MergeDocx is now part of Plutext’s Docx4j Enterprise Edition.

As Eric White’s blog explained:

This programming task is complicated by the need to keep other parts of the document in sync with the data stored in paragraphs. For example, a paragraph can contain a reference to a comment in the comments part, and if there is a problem with this reference, the document is invalid. You must take care when moving / inserting / deleting paragraphs to maintain ‘referential integrity’ within the document.

With this utility, merging/concatenating documents is as easy as invoking the method:

public  WordprocessingMLPackage merge(List&lt;WordprocessingMLPackage&gt; wmlPkgs)

In other words, you pass a list of docx, and get a single new docx back.

Edit March 2014. You can try the MergeDocx and/or MergePptx functionality via the demo webapp.

This utility takes care of the niggly edge cases for you:

You can also use my MergeDocx utility to process a docx which is embedded as an altChunk.

Without this utility, you had to rely on Word to convert the altChunk to normal content.

That meant you had to round trip your docx through Word, before docx4j could create a PDF or HTML out of it.

Now you don’t.

To process the w:altChunk elements in a docx, you invoke:

public WordprocessingMLPackage process(WordprocessingMLPackage srcPackage)

You pass in a docx containg altChunks, and get a  new docx back which doesn’t.

But wait a minute .. if you can merge Word documents using this tool, why would you ever put an altChunk (containing a docx, as opposed to HTML) into the docx in the first place?

Ordinarily you wouldn’t, you’d just merge with this tool instead.  But there are at least 2 possibilities:

  • some upstream process put the altChunk there, and now you want to process it in docx4j
  • OpenDoPE.  The Open Document Processing Ecosystem convention is being extended in a v2.3 to allow other documents to be injected, and a natural thing is to convert an injection instruction to an altChunk.  Edit Feb 2014: docx4j 3.0.1 can also bind an XML element containing a base64 encoded docx, inserting it into the docx as an AltChunk.  MergeDocx can then convert that content into “real” docx content, suitable for including in a table of contents, or generating HTML or PDF.  The binding is two-way, so user edits in Word can be injected back into the XML (eg for persisting to a database).

There is one place my code differs significantly from how Word processes an altChunk, and that is in section handling.  When Word processes an altChunk, it seems to largely remove sectPr.  So for example, columns will disappear.  But it also might merge headers, so the resulting header contains stuff from the headers of both documents!  My code doesn’t do that: by default, it includes each section, and headers go with sections.

docx4j v2.3.0 released

February 23rd, 2010 by Jason

I’m pleased to announce the release of docx4j v2.3.0

docx4j is an open source (Apache license) project which facilitates the manipulation of Microsoft OpenXML docx (and now pptx) documents in Java, using JAXB.

The main features of this release are support for pptx files, and improvements to HTML export (via NG2), and PDF export (via XSL FO).

For further details, please see the release announcement.

How to try Plutext for yourself

March 3rd, 2009 by Jason

Here is a screencast which walks you through sharing your own document, and trying our collaboration features:

Get the Flash Player to see this player.

Of course, you can just play with one of the pre-existing shared documents.

The video width is 1280 pixels, so if you are browsing in a narrow window, you’ll need to expand your browser window to see it properly.  (Everybody has screens that wide these days don’t they, unless they are mobile?)

For completeness: