Apr 13 2016

PDF/A-2b compliant Word to PDF

Plutext’s commercial PDF Word/docx Converter now produces fully PDA/A-2b compliant PDF output.

We say this having tested its output using http://verapdf.org/ “a purpose-built, open source, file-format validator covering all PDF/A parts and conformance levels.”

You can try our PDF Converter now, at http://converter-eval.plutext.com/

Sep 26 2015

Aspose.Confusion in Words

In June, Aspose’s Shoaib Khan published a blog post purporting to cover features available in Aspose.Words for Java but not docx4j.

It is either breathtaking or amusing in its inaccuracy, depending on whether you think it was born of deceit or ineptitude.  Either way, its a caution to anyone considering drinking the Aspose.Kool-Aid!

Here I’ll go through his claims one by one.

As a general comment though, it is worth remembering that with docx4j, you can do pretty much anything the docx/pptx/xlsx file formats allow.  If docx4j doesn’t have a high level API for something you want to do, you can always implement it yourself, thanks to docx4j’s lower level JAXB-based APIs.  And docx4j is real ASLv2 open source, so you can use the source Luke!

Without further ado…

Set Page Borders.
Here Aspose seems to be talking about section properties (ie margins etc).

Shoaib implies you can’t control these in docx4j.  Of course you can!  You can add or remove sections, or modify the settings of an existing section.

Track Changes in Documents.
Their AcceptAllRevisions method is said to be similar to Word’s “Accept All Changes”.

Docx4j doesn’t provide a high level API for doing this (since users haven’t asked for it), but a user could implement this for his/herself easily enough using XSLT or docx4j’s TraversalUtil.  You could start with this XSLT

Using Control Characters.
This example is a bit bizarre, because in a docx, specific elements w:br and w:cr are used for line breaks;  OpenXML follows the usual XML rules for whitespace.

Split Tables.
This example shows the steps a user would follow to split a table into two.  Basically, clone the existing table to make a new table with the same properties, then move rows from the first table to the second.

Of course you can do the same thing in docx4j!

Repeat Table Header Rows on Pages.
This example is just about setting the header row property.  See http://stackoverflow.com/questions/14605264/repeat-table-header

Clone Documents.
Cloning a document.  Aspose suggest you can’t do this with docx4j? WTF?! OpcPackage’s clone() method has been there since 2.7.1

docx4j also includes code for making partial copies, where less than a full clone is required.

Moving the Cursor in Document.
OpenXML is of course XML, which is hierarchical.  docx4j uses JAXB to give you an object model representation of that.  The hierarchical structure is basically nested lists.  Lots of stuff boils down to finding a position in a list, and then inserting or deleting etc using the Java collections API.  To find that list/position, you’d typically use docx4j’s powerful traversal functionality.   Or you can use XPath (in JAXB, the objects are bound to the underlying XML).

Aspose has some notion of cursor position, so you can move to the start or end of the document.  This may appeal to people with a VBA background, but in practice it is of little use.

Protect Documents.
Whilst it is true that in docx4j 3.2.0 there was no high level API for the functionality Microsoft Word groups under Protect Document (Mark as Final, Encrypt with Password, Restrict Editing etc), this is available now in the 3.3.0 previews:

Working with Digital Signatures.
As with the other Protect Document features, there was no high level API in 3.2.0.  That’s not to say you couldn’t do it, but there’s a nice API for this in the commercial Enterprise Ed. (forthcoming v3.3.0)

Check Format Compatibility.
This seems to be restricted to knowing what type of document you are working with; Aspose says it doesn’t validate the file format.

Per the specs, an OPC package has a content type.  In docx4j, docx/dotx/docm etc are all represented by WordprocessingMLPackage, but you can distinguish between them by calling getContentType(); the value will be one of:


Load Text File.
Whoopey do.  Apparently you can import plain text using Aspose’s expensive software!

Of course that is trivial with docx4j.  Docx4j also supports converting altChunks to native WordML content.  For XHTML altChunks, you need docx4j-ImportXHTML;  support for altChunks of type docx is an Enterprise level feature.

Specify Default Fonts.
The way the default font works in WordprocessingML is more complicated than you might expect, in that there are a few different ways you could affect it (via the theme part or the styles part).

That said, in real life, this doesn’t tend to be a problem.  With docx4j, you can easily set  w:docDefaults/w:rPrDefault/w:rPr/w:rFonts in your styles part, if you want to.

Working with Tables:
Autofit Setting to Tables.
This is just about the tblLayout setting: w:tblPr/w:tblLayout/@w:type, which you can access via TblPr’s get/setTblLayout

Joining Tables in Document.
I can’t recall anyone ever asking for docx4j to provide a high level API to do this, but it could be added.  In the meantime, docx4j allows to you to do anything with tables which the file format allows, including joining tables.

Mail Merge
Mail Merge from XML Data Source.
docx4j provides a high level API for working with legacy MERGEFIELD fields.

If you wanted to fill those fields with data from XML, you could do that easily enough.

Where docx4j really shines though, is in its support for content control data binding.  In that approach, introduced by Microsoft in 2007, you have a bidirectional XPath mapping between content controls in the document, and an XML file.

If you are working with XML, and not forced to work with legacy MERGEFIELDs for some reason, content control data binding is the way to go.

Jun 16 2015

Off topic: Eclipse’s maven from a command line?

You’ve installed Eclipse.

Eclipse includes maven (m2e).

Can you use that Maven from outside Eclipse, or do you need to install maven again/separately?

It turns out you can use it.  Whether its worth the effort or not is another question…

You launch maven using the plexus classworlds launcher.

That needs a config file.

The config file (call it m2.conf) contains something like:

main is org.apache.maven.cli.MavenCli from plexus.core

set maven.home default /home/jason/.m2

load /home/jason/eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/*.jar

With that, from a project dir, the following is the equivalent of ‘mvn install’:

java -cp ../eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/plexus-classworlds-2.5.1.jar:../eclipse/plugins/org.slf4j.api_1.7.2.v20121108-1250.jar  “-Dclassworlds.conf=m2.conf” org.codehaus.plexus.classworlds.launcher.Launcher install

You could make a shell script to do that.  And you could base the shell script on the one included in the maven distribution.  Or more sensibly, you’d just download  install and use  maven proper.

If you need it.  Right clicking on your project in Eclipse then Run As gives you a handy UI for maven-related stuff:

Finally, note it is  possible to avoid the plexus classworlds launcher by invoking MavenCli directly:

java -cp “../eclipse/plugins/org.slf4j.api_1.7.2.v20121108-1250.jar:../eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/*” org.apache.maven.cli.MavenCli install

If you are going to use that, you might wonder about setting maven.home with -Dmaven.home=/your/path/to/where immediately after the classpath

Jun 16 2015

docx4j from GitHub in Eclipse – 3 years on

In May 2012 we posted docx4j-from-github-in-eclipse. That was more than 3 years ago now, so its about time to update that walkthrough :-)

This post is about getting the docx4j source code setup in Eclipse, so you can not just use it, but easily study it as well (and submit pull requests!).  If you have no interest or need to do that, please see hello-maven-central (if you’re already using a recent Eclipse, you can start at step 4) and/or docx4j-3-0-and-maven (but do use our current version 3.2.x)

Preliminaries – JDK

Make sure you have the JDK installed; Java 6 or later.  The JRE alone is not enough, since it doesn’t include a compiler (javac).

Preliminaries – Eclipse

Install Eclipse.  These days, the basic package has everything you need (ie git and maven support):

Git & GitHub

GitHub is docx4j’s authoritative source repository.  Eclipse now includes a git client. (If you have an older Eclipse, you can install eGit) However, it is still handy to have other git clients installed:

  • on Linux (listed first, given git’s provenance), install git using your distribution’s package manager
  • on Windows, the Git BASH shell is handy; as is Atlassian’s SourceTree
  • on OSX, ditto

Clone or Fork?

With Git, there is a difference between cloning and forking.

  • Cloning gives you a copy of the source code you can work on, but without more, no easy way to contribute changes back.
  • Forking sets you up with the source code, and makes it easy to contribute changes back.

If you think you might be making changes to the docx4j source code, you’re probably best to create a fork on GitHub right from the start.

To create a fork, log in to GitHub, visit https://github.com/plutext/docx4j then press the “Fork” button.

Choose your poison

There are 3 steps to installing docx4j:

  1. clone the docx4j repo
  2. install its dependencies
  3. install docx4j project in Eclipse

You can do these 3 steps entirely within Eclipse, but Eclipse by default doesn’t give much feedback as to what its doing, so you might wonder whether its still working properly.

Since its just as easy (or easier) to use the command line, I’ll show that way first:-

Command Line Approach

To do it this way, you’ll need:

  • a git shell, and
  • Maven

Both of these are worth having in any case.

Step 1. To clone docx4j from your git shell, use the github URL for docx4j (your fork or Plutext’s):

$ git clone -b master –single-branch https://github.com/plutext/docx4j docx4j
Cloning into ‘docx4j’…
remote: Counting objects: 42008, done.
remote: Compressing objects: 100% (58/58), done.
remote: Total 42008 (delta 23), reused 7 (delta 0), pack-reused 41946
Receiving objects: 100% (42008/42008), 61.03 MiB | 128.00 KiB/s, done.
Resolving deltas: 100% (25108/25108), done.

You should now have a docx4j directory, containing the docx4j source code.

Step 2. Next, to get docx4j’s dependencies, you’ll need Maven.

So first, install Maven (if you don’t have it already).  Please see the instructions at maven.apache.org (actually, you’ve already got Maven in Eclipse, but its a bit hard to use from the command line).

Now you can go into your docx4j directory, and type:

mvn install -DskipTests=true

You’ll see Maven download docx4j’s dependencies.

Step 3. Now you are ready to start Eclipse.

Because docx4j includes Eclipse project definition files, you can import the docx4j project.

From the File menu, click Import, then Existing Projects into Workspace:

Browse to your docx4j directory:

Then click Finish.

Now the project should be set up correctly.  If you see errors, please refer further below for troubleshooting.

Eclipse only approach

Step 1. clone the docx4j git repo in Eclipse; for this you need the Git Repositories View:

Window > Show View > Other > Git > Git Repositories View

Click “Clone a Git repository” then enter the URI for docx4j (your fork, or Plutext’s), then click Next.

The master branch is probably all you need (though Eclipse will probably fetch all the others at some point anyway!)

Step 2. On the next screen, you can tick “Import all existing projects after clone finishes“.  (if you don’t do that, you’ll have to manually File > Import, then Existing Projects into Workspace, as explained above)

Step 3. Eclipse will now start building the project; first Maven will get the dependencies.

This may take a while … to see what Eclipse is doing while it displays the status “Building workspace”, from the Console view, click the drop down to see the Maven Console:

There you can watch it downloading stuff.

You can also look at the Progress view.

When its all done, you should have a docx4j project there and ready to go!


I don’t cover issues with git clone or maven here; just issues with Eclipse.

If Eclipse has a problem with your docx4j project, you’ll see an exclamation mark:

You can see further info in the Problems view; the most likely problem is that your Java is misconfigured:-

To fix this, on the docx4j project, click Alt-Enter to go into its properties.

Then click Java Build Path, then the Libraries tab.

Do you see a red cross next to JRE System Library, as above?

If so click on the JRE System Library entry to select it, then click the Remove button.

Next click Add Library, the JRE System Library, then add one (1.6 or above).

Note the warning:

That’s OK, we changed the JRE on the Java Build Path up above.

Hello World

Now you are ready to run some docx4j code.

A good place to start is to run  CreateWordprocessingMLDocument

Use docx4j in your own project

To use docx4j in your own project, there are 2 approaches:

  • the Maven way.  If you’re planning to use Maven, you just specify docx4j as a dependency, and if the version matches (look in pom.xml), it’ll use your docx4j project (assuming workspace resolution is switched on).  Please see hello-maven-central (if you’re already using a recent Eclipse, you can start at step 4) and/or docx4j-3-0-and-maven (but do use the version specified in pom.xml)
  • or, via the Java Build Path > Projects tab.

Feb 10 2015

High fidelity PDF output

This post introduces our new commercial component for docx to PDF output.

The background is that docx4j’s standard method of producing PDF output has been via XSL FO, using Apache FOP.

This has worked well enough for some docx4j users, but it has certain limitations which can bite you, for example lack of tab and tab stop support in XSL FO.

And because there are differences between FOP’s layout engine and Word’s, page breaks may fall in different places.

This means the FO based PDF output in docx4j is about as good as its going to get (short of enhancing the FO renderer).

To do better, we’ve had to invest in a non-FO approach, using layout algorithms specifically designed to give the same results Word does.

You can try it now.

A side benefit is that this new approach is much faster than the FO approach.

The component is actually independent of docx4j.  This means it’ll also work great  if you need to convert docx to PDF from C# (without Word), Python, PHP etc.

Pricing is at plutext.com

Jan 20 2015

Content controls for business data connectivity

Sometimes, Word is a natural way for people to interact with back end applications (eg SAP).

This is particularly so when:

  • business data will be output to a Word document,
  • the user is more familiar with Word than the other system,
  • certain data updates may be required (and are permitted)

So maybe there are 4 high level categories:

  • apps which support commercial transactions (a recipient will receive a Word document), eg
    • employment onboarding (letter of employment)
    • invoicing
  • apps with a reporting component:  is the report format a natural interface if it was made bi-directional?
  • workflow/BPM systems which present documents (work orders, proposals, approvals etc)
  • policy/procedures in regulated industries, where a worker must follow a series of steps.  Can present that in a docx; the worker can tick the steps off as they do them
    • related training scenarios?

Microsoft had an emphasis on what they then called “Office Business Applications” back around the Office 2007 launch.  Fast forward to today, and “business connectivity services” are part of SharePoint.

But you can achieve the same sort of thing without SharePoint, using docx4j and data bound content controls.

Once you have your back end data in an XML format (and there are many tools/techniques to help with this), you can use content controls to bind what the user sees in Word to elements in that XML.

The beauty of it is that the binding is bi-directional, so if the user edits the document, the XML is updated (ie it stays in synch).

After the user has made their edit, you can update the back end application.  Typically, you’d do this after they saved & closed the document (ie outside Word, using docx4j), but you could also do it from within Word (a less good approach, but still, an option).

What if there is some data which the user shouldn’t be able to edit?  You simply lock the content control to prevent editing.

To quickly try out this approach, put together some sample XML, then upload it as explained here, to get a docx you can experiment with.

We’d loved to hear about how you might use this approach?

Jan 20 2015

I have my XML, now what?

A barrier to using  content control data binding has been getting a feel for what the solution might look like.  You have your XML data, but what do you do next?

The audience for this post is broader than docx4j users; its for anyone wanting to easily set up a template docx using XML data binding, just to get a feel for what it is like from an authoring/Word perspective.

How do you add content controls to your Word document, and map them to the XML data?

For docx4j purposes, you use an OpenDoPE Word AddIn (of which there are two you can try).

Alternatively, Word 2013 introduced the XML mapping task pane (and some additional file format features) .  The docx linked in my previous blog post highlighted some usability issues with the XML mapping pane.

This post presents a “fast start” way you can try.  Simply upload your XML file here, and it will give you back a docx, with content controls mapped to that sample XML.

To take the following XML for an invoice as an example:

You’d get back the following Word 2013 docx (note that Word 2010 silently strips the repeats without warning):

In “design mode”, it looks like:

Whether you’re in design mode or not, you can edit the content of a content control, then to satisfy yourself that things are working correctly, save the docx, then unzip it and see your altered data in /customxml/item1.xml

This idea is that once you have the basic docx, you can quickly/easily edit it in Word to make it prettier:

The tool recognises and uses the following content control types:

  • repeats: automatically detect repeats and set up suitable content control structures (we use a table if there’d be more than one column) fully populated with the XML data
    • note that these are Word 2013 repeatingItem structures, not OpenDoPE repeats.  (We recommend you use OpenDoPE repeats, but the tool creates Word 2013 repeatingItems, so you can see what Word 2013 users get out of the box).
  • pictures/images: will use a picture content control if the field contains base64 encoded data, or the string PICTURE or string IMAGE
  • checkbox: generated if value is true or false
  • escaped Flat OPC XML: a bound rich text content control will be inserted if the element contains the string FLAT-OPC or the string   ?mso-application progid=”Word.Document
  • date control: used if the element content is the string DATE

Have fun.

Jan 17 2015

Word 2013 repeatingSection content controls – ready for prime time?

For developers wondering about Microsoft’s commitment to content controls, Office 2013 was certainly good news.

In Microsoft’s “What’s new for Word 2013 developers”, 2 of the 4 items were about content controls:

  • Enhancements to content controls
  • UI for XML mappings

And in http://blogs.office.com/2012/10/17/top-5-reasons-developers-will-love-word-2013/ 4 of the 5 reasons relate to content controls!

The MSDN article “What’s new with content controls in Word 2013” describes the changes in detail, but one of them was the introduction of repeating section content controls, which are comparable to OpenDoPE repeats (which docx4j has supported for ages).

The question is whether the time is ripe to migrate from OpenDoPE repeats to Word 2013 repeatingSection content controls?

My suggested answer is “no, or at least, not yet”, because

  1. Word 2010 strips out Word 2013 repeating content controls, and does so without warning!  Compare OpenDoPE repeats, which work in Word 2007, 2010 and 2013.  So until Word 2010 becomes irrelevant (or support is back ported), Word 2013 repeating content controls can’t be used in a generic solution.
  2. Word 2013 doesn’t handle the case of repeat zero times as you’d expect; it leaves a single instance, which will cause problems in many applications.

For authoring, the XML Mapping Pane in Word 2013 also leaves a bit to be desired.  For more details, please see w15RepeatingSection_cf_OpenDoPE.docx

Even so, docx4j 3.2.2 will support processing Word 2013 repeating content controls, for those who still choose to use them.

Dec 04 2014

Docx4jHelper Word AddIn

The dream:

  • View Open XML right from within Word, and see what happens when you edit it.
  • Or generate corresponding docx4j Java code, with deep links into the corresponding docx4j source code and Open XML spec.

Regular users of docx4j will be aware of our webapp, which amongst other things, generates docx4j Java code for the specified Open XML in your sample docx/pptx/xlsx.

The webapp is useful, but it has a few draw backs:

  • you have to upload your docx/pptx/xlsx, which takes time
  • if your docx/pptx/xlsx contains sensitive data, you probably want to remove that first
  • the webapp might be down

To address these issues, we’re now offering the code gen functionality as a Word AddIn.

If you install the Word AddIn, this means you can now generate code without your docx leaving your computer.

This is all feasible because docx4j can run as a DLL in a .NET project, thanks to IKVM!

Where to get it

You can download the installer.  After you complete the landing form (using your corporate email address, not gmail etc), you’ll be sent a download link.

Getting Started

After a successful installation, after restarting Word, you should see a “Docx4j” menu, containing:

To generate code, first press the “Load Helper” button.

You’ll see the following form:

Its inviting you to start a local web server which will run the same code as the existing webapp.  Just choose a port you aren’t already using.  If for some reason you want to browse using Internet Explorer (as opposed to your default browser), check the box.

It’ll take a little while to start the server; you’ll see a dialog when its started.

Now you can generate code.  To do so, select something in your docx, then click the “Generate Code” button.

After a while, a window will open in your web browser, and you’ll see:

That’s the view of the docx package, which will be familiar if you’ve used the webapp.   For how to generate code from here, see our earlier post.

Code generating is done on your computer.  (But note, the links on that page to docx4j source code and the OpenXML spec are external links)

What about the “Edit OpenXML” button?

If you select something in your docx, then click that button, after a while (maybe 30 secs the first time!), you’ll see the corresponding XML in an editor window:

You can go ahead and edit it, then click the “Apply” button.

If Word likes your XML, you’ll see your changes on the document surface.  Ctrl Z should work for undo.

So there are 2 ways to see the underlying XML

The first way we described uses your web browser; the second is a Windows Form.

These two views have different features; maybe a later release will unify them?

What about pptx, xlsx?

There’s no reason in principle we couldn’t make a similar AddIn for Powerpoint and Excel.  In fact, we plan to make these, once any teething issues have been ironed out in the WordAddIn.

In the meantime, for pptx and xlsx, you can continue to use the webapp.

Help, Suggestions and other Discussion

If you are a Plutext customer experiencing an issue, please email support@plutext.com

Otherwise, please check the Docx4jHelper AddIn forum.

We’ve got some ideas for where the AddIn goes from here, but we’d love to hear yours.

Oct 04 2014

Web-based docx editing?

Following on from the previous post on content tracking, some people have been asking about how to edit a docx in a web browser.

So I thought I’d link to a proof of concept we did a year or so ago.

The idea is:

  • use docx4j to convert the docx to XHTML
  • use CKEditor to edit that XHTML in the web browser
  • on submit, convert the XHTML back to docx content

The general problem with converting to/from XHTML is the “impendance mismatch”.  That is, losing stuff during round trip.  This will be a familiar problem to anyone who has ever edited a docx in Google Docs or LibreOffice.

This demo addresses that problem by identifying docx content which CKEditor would mangle, and then on submit/save, using the original docx content for those bits.

In this demo, the problematic content is replaced with visual placeholders, so you can see it is there.

The intent is that you can add/edit text content in the browser, without other document content (headers/footers, text boxes etc) getting lost.

To give it a try, go to the upload page and choose a docx file from your computer

You should see your docx open with the CKEditor toolbars above it:

(In the demo and screenshot above, the grey “B” image represents a bookmark)

Make some edits, then hit the Submit button (at the bottom).

The docx will be streamed back to your computer as a download in your browser.

Now open it in Word, and compare it to the original.


If you want to add this type of functionality to your application, please let us know by emailing jharrop@plutext.com

We’d love to hear:

  • a bit more about your use case,
  • where you see your users doing their web-based editing:- on your intranet, extranet, or the web at large?
  • what kind of editing? is it proof reading,  customising particular sections, a step in a workflow..?
  • do you need to cater for iPads or Android tablets?  And if so, is a dedicated app on your roadmap?
  • any additional requirements you might have!

Update (Oct 2015)

Source code is available at https://github.com/plutext/docx-html-editor