Archive for the ‘docx’ Category

docx4j in a single page

May 15th, 2013 by Jason

Here’s a single A4 page reference/overview of docx4j aka a cheat sheet, in PDF or PNG format.

This one is focused on docx files (WordprocessingML).

I’ll create something similar for pptx and xlsx over coming days.

docx4j/pptx/xlsx online code generation

May 15th, 2013 by Jason

Just launched is http://webapp.docx4java.org

You should be able to see it in the menu at the top right of this website (if not, reload the web page…).

There are three things you can do with it right now:

• Explore your docx/pptx/xlsx and its representation in docx4j

• Convert  docx to PDF or XSL FO

• Merge docx files (eg cover letter plus contract) into a single docx, using Plutext’s MergeDocx. Or the same thing for pptx files, using MergePptx.

Here I want to focus on the first of these.

After you’ve uploaded your docx/pptx/xlsx, the first thing you see is like docx4j’s PartsList sample:

Here, I’ll click in the left hand column to look at the main document part, document.xml

When I do that, I see the XML:

No surprises there.

But notice the hyperlinks.  Here I’ll just click on the first w:p.

What you get back, is Java source code to create that complete structure:-

As you can see from the image above, both styles of code (as described in docx4j’s Getting Started document) are produced for you.  With a bit of luck, you can cut/paste either into your IDE (Eclipse or whatever), and just run with it!

To actually see the created object in an Office document, you’ll still need to add the created object to a part.  See Getting Started, or the cheat sheet for how to do that.

I hope this helps you to create/modify your Office documents more efficiently,with docx4j!

Do let us know what you think in the comments, or in docx4j’s forums.

docx4j 2.7.0 released

July 8th, 2011 by Jason

I’m pleased to announce the release today of docx4j 2.7.0.

What is docx4j?

docx4j is an open source (Apache v2) library for creating, editing, and saving OpenXML “packages”, including docx, pptx, and xslx.  it is similar to Microsoft’s OpenXML SDK, but for Java rather than .NET.   It uses JAXB to create the Java objects out of the OpenXML parts.

Notable features for docx include export as HTML or PDF, and CustomXML databinding for document generation (including our OpenDoPE convention support for processing repeats and conditions).

The docx4j project started in October 2007.

What’s new?

This is mainly a maintenance release; things of note include:

  • Improvements to Maven build
  • ContentAccessor interface
  • AlteredParts: identify parts in this pkg which are new or altered; Patcher
    which adds new or altered parts.
  • Support for .glox SmartArt package (/src/glox/)
  • JAXB RI 2.2.3 compatibilty
  • OpenDoPE support improvements

Where do you get it?

Binaries: You can download a jar alone or a tar.gz with all deps or pick and choose.

Source: Checkout the source from SVN (use the pom.xml file to satisfy the dependencies eg with m2eclipse, or download them from one of the links above)

Maven: Please see forum for details (since XML doesn’t paste nicely here right now).

Dependency changes

Antlr is now required for OpenDoPE processing; this gives us better XPath processing.  The required jars are:

Getting Started

See the “Getting Started” guide.

Thanks to our contributors

A number of contributions have made this release what it is; thanks very much to those who contributed.

Contributors to this release and a more complete list of changes may be found in README.txt

A request to docx4j users

If you are happily using docx4j, it would be great if you could reply to this post with some words of recommendation for others who might be wondering whether docx4j is a good choice. I know there are thousands of you out there :-)

Some users have been kind enough to make such statements already; these may be found on the trac homepage.

Of course, there are a number of other ways you can contribute back.  Please consider doing so, especially if you think you might find yourself looking for support from volunteers in the docx4j forums.

Merging Word documents

November 14th, 2010 by Jason

I’ve written a utility to merge docx documents in Java.  “Merge” as in concatenate/join/append, as opposed to diff/merge (although docx4j does include code to do a diff, if you are looking for that instead).

With the utility, you can take 2 or more Word documents, and join them into one.

Edit Feb 2014. MergeDocx is now part of Plutext’s Docx4j Enterprise Edition.

As Eric White’s blog explained:

This programming task is complicated by the need to keep other parts of the document in sync with the data stored in paragraphs. For example, a paragraph can contain a reference to a comment in the comments part, and if there is a problem with this reference, the document is invalid. You must take care when moving / inserting / deleting paragraphs to maintain ‘referential integrity’ within the document.

With this utility, merging/concatenating documents is as easy as invoking the method:

public  WordprocessingMLPackage merge(List<WordprocessingMLPackage> wmlPkgs)

In other words, you pass a list of docx, and get a single new docx back.

Edit March 2014. You can try the MergeDocx and/or MergePptx functionality via the demo webapp.

This utility takes care of the niggly edge cases for you:

You can also use my MergeDocx utility to process a docx which is embedded as an altChunk.

Without this utility, you had to rely on Word to convert the altChunk to normal content.

That meant you had to round trip your docx through Word, before docx4j could create a PDF or HTML out of it.

Now you don’t.

To process the w:altChunk elements in a docx, you invoke:

public WordprocessingMLPackage process(WordprocessingMLPackage srcPackage)

You pass in a docx containg altChunks, and get a  new docx back which doesn’t.

But wait a minute .. if you can merge Word documents using this tool, why would you ever put an altChunk (containing a docx, as opposed to HTML) into the docx in the first place?

Ordinarily you wouldn’t, you’d just merge with this tool instead.  But there are at least 2 possibilities:

  • some upstream process put the altChunk there, and now you want to process it in docx4j
  • OpenDoPE.  The Open Document Processing Ecosystem convention is being extended in a v2.3 to allow other documents to be injected, and a natural thing is to convert an injection instruction to an altChunk.  Edit Feb 2014: docx4j 3.0.1 can also bind an XML element containing a base64 encoded docx, inserting it into the docx as an AltChunk.  MergeDocx can then convert that content into “real” docx content, suitable for including in a table of contents, or generating HTML or PDF.  The binding is two-way, so user edits in Word can be injected back into the XML (eg for persisting to a database).

There is one place my code differs significantly from how Word processes an altChunk, and that is in section handling.  When Word processes an altChunk, it seems to largely remove sectPr.  So for example, columns will disappear.  But it also might merge headers, so the resulting header contains stuff from the headers of both documents!  My code doesn’t do that: by default, it includes each section, and headers go with sections.

docx4j v2.3.0 released

February 23rd, 2010 by Jason

I’m pleased to announce the release of docx4j v2.3.0

docx4j is an open source (Apache license) project which facilitates the manipulation of Microsoft OpenXML docx (and now pptx) documents in Java, using JAXB.

The main features of this release are support for pptx files, and improvements to HTML export (via NG2), and PDF export (via XSL FO).

For further details, please see the release announcement.

How to try Plutext for yourself

March 3rd, 2009 by Jason

Here is a screencast which walks you through sharing your own document, and trying our collaboration features:

Get the Flash Player to see this player.

Of course, you can just play with one of the pre-existing shared documents.

The video width is 1280 pixels, so if you are browsing in a narrow window, you’ll need to expand your browser window to see it properly.  (Everybody has screens that wide these days don’t they, unless they are mobile?)

For completeness:

Plutext collaboration for Word: new features

March 2nd, 2009 by Jason

We’ve just published a new build of the Word Add-In, which among other things, supports replication between users of images and comments.

For a good while now, with Plutext you’ve been able to be in a Word document at the same time as your co-workers – provided all you were doing was working on tables and paragraphs (editing them, inserting, deleting or moving them around).

With this latest release, you can add images and Word comments, and have them replicate properly between Word 2007 users.

Here is a screencast of this in action:

Get the Flash Player to see this player.

If you want to play with this yourself, you can download our Word Add-In and give it a shot!

For username & password, please see here. The password is “tester”.

For detailed instructions, see this PDF, or this earlier screencast.

If you’d like to chat about your own Plutext installation, please contact us using this form.

collaborate on a Word doc with docx4all

November 16th, 2008 by Jason

docx4all has now reached the point where you can collaborate happily with a Word user, both working on the document at the same time.

This screencast shows a docx4all user and a Word user doing that:

Get the Flash Player to see this player.

docx4all will work on any platform if you have Java 6 installed – including Windows, OSX, or Linux.

You can try collaborating now, in your web browser by clicking here (warning: ~10 MB).  The download is of course one-time.  Next time, it will start quicker.

That link takes you to the docx4all applet, which does collaboration in your web browser.

You can also run docx4all as a desktop application – the functionality is identical.

The nice thing about the docx4all experience is that with just one-click you can be collaborating. Ok, a couple of clicks – one to start docx4all, and another to do File > Open.

Because all changes are versioned, from the Plutext menu you can see:

  • a history of all the changes which have been made to a given content control
  • a version of the document showing the most recent change to each paragraph

docx4j v2.1.0 released

November 11th, 2008 by Jason

We’re pleased to announce that we’ve released v2.1.0 of docx4j.  Get it from our downloads page.

docx4j is an open source Java library for manipulating OpenXML WordprocessingML documents, released under the Apache software licence. docx is the default file format in Word 2007 in Microsoft Office 2007, and part of an ISO standard (more or less unchanged).

v2.1.0 is mainly a maintenance release.

Attention has been paid to ease of use of hyperlinks, images, and headers/footers.

The HTML output has been redone to use the XSLT from the OpenXMLViewer project; it can be configured to save images as files, and automatic list numbers are handled.

This release should also work under Java 1.5, now that I have re-built fop-fonts.  I had contributed TTC (true type collection) handling code to FOP, and it was accepted, so fop-fonts now uses that (ie the patch which makes fop-fonts is that much smaller).

docx4j v2.0 released

July 22nd, 2008 by Jason

We’re pleased to announce that we’ve released v2.0 of docx4j.

docx4j is an open source Java library for manipulating OpenXML WordprocessingML documents, released under the Apache software licence. docx is the default file format in Word 2007 in Microsoft Office 2007.

docx4j supports the following:

  • Open existing docx (from filesystem, SMB/CIFS, WebDAV using VFS)
  • Create new docx (just one line of code)
  • Programmatically manipulate the docx document (of course), including tables, images
  • Import a binary doc (proof of concept)
  • Import/export Word 2007′s xmlPackage (pkg) format
  • Save docx to filesystem as a docx (ie zipped), or to JCR (unzipped)
  • Apply transforms, including common filters
  • Export as HTML or PDF
  • Diff/compare paragraphs or sdt (content controls), outputting OpenXML with changes marked up
  • Font support (font substitution, and use of any fonts embedded in the document)
  • Use the power of JAXB to do other cool stuff

Get it from here.

What is it about this release that warrants being labeled v2.0?

The new features include image support, diff, and xmlPackage.  A factor is the version numbering convention Microsoft has chosen for their Open XML SDK: its v2.0 which will first contain an API for WordprocessingML.

So think of a “level 1″ API as one which handles the Open Packaging conventions (basically, the unzipping step), but leaves you to handle the document (part) content using low level XML (DOM, SAX, etc).

A “level 2″ API is one which gives you a higher level API to manipulate the part content.  At the very least, this would include objects to represent paragraphs, tables, styles etc.  But you’d also expect it to be easy, for example, to add a paragraph using a specified style (maybe this is “level 3″?  In any case, docx4j can do it)

Given that docx4j brought a “level 2″ WordML API to the Java world 6 months ago, it is appropriate that it be labelled version 2.0.