docx4java aka docx4j – OpenXML office documents in Java

Archive for the ‘docx4j’ Category

docx4j v2.6.0 released

November 19th, 2010 by Jason

I published docx4j 2.6.0 yesterday.

For details, see the forum. This post introduces TraversalUtil, which makes it easier for you to find and change the bits of a docx you want to manipulate.

If you are working with an existing docx, you often need to get a particular bit of the document, and change it somehow.

If you know you want to change the 6th paragraph, say, that’s easy.

But if you want to find all occurrences of some item, which could occur at various different levels of the hierarchy (for example, paragraphs can appear not just in the document body, but also within table cells, and in content controls)?

docx4j offers a couple of different tools to make this easy.

XPath

XPath is a succinct way to select the things you need to change.

Happily, from docx4j 2.5.0, you can do use XPath to select JAXB nodes:

MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();

String xpath = "//w:p";

List<Object> list = documentPart.getJAXBNodesViaXPath(xpath, <strong>false</strong>);

These JAXB nodes are live, in the sense that if you change them, your document changes.

There is a limitation however: the xpath expressions are evaluated against the XML document as it was when first opened in docx4j. You can update the associated XML document once only, by passing true into getJAXBNodesViaXPath. Updating it again(with current JAXB 2.1.x or 2.2.x) will cause an error.

To workaround this bug in JAXB, you can marshall it, and then unmarshall the result using either:

    public org.docx4j.wml.Document unmarshal( java.io.InputStream is ) 

    public org.docx4j.wml.Document unmarshal(org.w3c.dom.Element el)

Both of those will re-create the binder.

Not the most efficient, so consider voting for JAXB bug 459

But now we have an alternative…

TraversalUtil

New to docx4j 2.6.0 is a class TraversalUtil, which is a general approach for traversing the JAXB object tree in the main document part (though it can also be applied to headers, footers etc).

For example, to get a list of hyperlinks, you can do something like:

PHyperlinkFinder finder= new PHyperlinkFinder();
new TraversalUtil(paragraphs, finder);

static class PHyperlinkFinder extends CallbackImpl {
			
        List<P.Hyperlink> links = new ArrayList<P.Hyperlink>();  
        	
        @Override
		public List<Object> apply(Object o) {
				
			if (o instanceof P.Hyperlink)
				links.add((P.Hyperlink)o);
				
			return null;
		}
	}

This approach is used extensively in the MergeDocx extension I discussed in my previous post. It is now also the basis of the OpenMainDocumentAndTraverse sample, so see that for another example of how to use it.

The example above simply finds relevant bits of the docx; you could also modify the objects encountered if you want.

Posted in docx4j, jaxb, OpenXML | Comments Off on docx4j v2.6.0 released

Merging Word documents

November 14th, 2010 by Jason

I’ve written a utility to merge docx documents in Java. “Merge” as in concatenate/join/append, as opposed to diff/merge (although docx4j does include code to do a diff, if you are looking for that instead).

With the utility, you can take 2 or more Word documents, and join them into one.

Edit Feb 2014. MergeDocx is now part of Plutext’s Docx4j Enterprise Edition.

As Eric White’s blog explained:

This programming task is complicated by the need to keep other parts of the document in sync with the data stored in paragraphs. For example, a paragraph can contain a reference to a comment in the comments part, and if there is a problem with this reference, the document is invalid. You must take care when moving / inserting / deleting paragraphs to maintain ‘referential integrity’ within the document.

With this utility, merging/concatenating documents is as easy as invoking the method:

public  WordprocessingMLPackage merge(List&lt;WordprocessingMLPackage&gt; wmlPkgs)

In other words, you pass a list of docx, and get a single new docx back.

Edit March 2014. You can try the MergeDocx and/or MergePptx functionality via the demo webapp.

This utility takes care of the niggly edge cases for you:

You can also use my MergeDocx utility to process a docx which is embedded as an altChunk.

Without this utility, you had to rely on Word to convert the altChunk to normal content.

That meant you had to round trip your docx through Word, before docx4j could create a PDF or HTML out of it.

Now you don’t.

To process the w:altChunk elements in a docx, you invoke:

public WordprocessingMLPackage process(WordprocessingMLPackage srcPackage)

You pass in a docx containg altChunks, and get a new docx back which doesn’t.

But wait a minute .. if you can merge Word documents using this tool, why would you ever put an altChunk (containing a docx, as opposed to HTML) into the docx in the first place?

Ordinarily you wouldn’t, you’d just merge with this tool instead. But there are at least 2 possibilities:

some upstream process put the altChunk there, and now you want to process it in docx4j
OpenDoPE. The Open Document Processing Ecosystem convention is being extended in a v2.3 to allow other documents to be injected, and a natural thing is to convert an injection instruction to an altChunk. Edit Feb 2014: docx4j 3.0.1 can also bind an XML element containing a base64 encoded docx, inserting it into the docx as an AltChunk. MergeDocx can then convert that content into “real” docx content, suitable for including in a table of contents, or generating HTML or PDF. The binding is two-way, so user edits in Word can be injected back into the XML (eg for persisting to a database).

There is one place my code differs significantly from how Word processes an altChunk, and that is in section handling. When Word processes an altChunk, it seems to largely remove sectPr. So for example, columns will disappear. But it also might merge headers, so the resulting header contains stuff from the headers of both documents! My code doesn’t do that: by default, it includes each section, and headers go with sections.

Posted in docx, docx4j, Microsoft Word, OpenXML | Comments Off on Merging Word documents

docx4j v2.3.0 released

February 23rd, 2010 by Jason

I’m pleased to announce the release of docx4j v2.3.0

docx4j is an open source (Apache license) project which facilitates the manipulation of Microsoft OpenXML docx (and now pptx) documents in Java, using JAXB.

The main features of this release are support for pptx files, and improvements to HTML export (via NG2), and PDF export (via XSL FO).

For further details, please see the release announcement.

Posted in docx, docx4j, jaxb, Microsoft Word, OpenXML | Comments Off on docx4j v2.3.0 released

docx4j v2.1.0 released

November 11th, 2008 by Jason

We’re pleased to announce that we’ve released v2.1.0 of docx4j. Get it from our downloads page.

docx4j is an open source Java library for manipulating OpenXML WordprocessingML documents, released under the Apache software licence. docx is the default file format in Word 2007 in Microsoft Office 2007, and part of an ISO standard (more or less unchanged).

v2.1.0 is mainly a maintenance release.

Attention has been paid to ease of use of hyperlinks, images, and headers/footers.

The HTML output has been redone to use the XSLT from the OpenXMLViewer project; it can be configured to save images as files, and automatic list numbers are handled.

This release should also work under Java 1.5, now that I have re-built fop-fonts. I had contributed TTC (true type collection) handling code to FOP, and it was accepted, so fop-fonts now uses that (ie the patch which makes fop-fonts is that much smaller).

Posted in docx, docx4j, OpenXML | Comments Off on docx4j v2.1.0 released

docx4j v2.0 released

July 22nd, 2008 by Jason

We’re pleased to announce that we’ve released v2.0 of docx4j.

docx4j supports the following:

Open existing docx (from filesystem, SMB/CIFS, WebDAV using VFS)
Create new docx (just one line of code)
Programmatically manipulate the docx document (of course), including tables, images
Import a binary doc (proof of concept)
Import/export Word 2007’s xmlPackage (pkg) format
Save docx to filesystem as a docx (ie zipped), or to JCR (unzipped)
Apply transforms, including common filters
Export as HTML or PDF
Diff/compare paragraphs or sdt (content controls), outputting OpenXML with changes marked up
Font support (font substitution, and use of any fonts embedded in the document)
Use the power of JAXB to do other cool stuff

Get it from here.

What is it about this release that warrants being labeled v2.0?

The new features include image support, diff, and xmlPackage. A factor is the version numbering convention Microsoft has chosen for their Open XML SDK: its v2.0 which will first contain an API for WordprocessingML.

So think of a “level 1” API as one which handles the Open Packaging conventions (basically, the unzipping step), but leaves you to handle the document (part) content using low level XML (DOM, SAX, etc).

A “level 2” API is one which gives you a higher level API to manipulate the part content. At the very least, this would include objects to represent paragraphs, tables, styles etc. But you’d also expect it to be easy, for example, to add a paragraph using a specified style (maybe this is “level 3”? In any case, docx4j can do it)

Given that docx4j brought a “level 2” WordML API to the Java world 6 months ago, it is appropriate that it be labelled version 2.0.

Posted in docx, docx4j, jaxb | Comments Off on docx4j v2.0 released

docx4j now released under Apache License

April 10th, 2008 by Jason

We’re pleased to announce that docx4j is now available under the Apache License (v2).

This is a response to feedback on an earlier post. This is also the last license change we’ll be making to docx4j. Word documents are mostly manipulated in corporate environments. This change removes barriers to adoption of docx4j by business and institutions.

docx4j uses org.merlin.io to efficiently turn streams inside out. That package had been available under the GPL. Its author, Merlin Hughes, today kindly released it under v2 of the Apache License, so we now use it under that license.

There’s a new nightly build of docx4j available from the downloads page if you want to grab it. This build can load/save to/from a WebDAV server – more on that in another post.

Posted in docx, docx4j, OOXML | Comments Off on docx4j now released under Apache License

Click to try docx4all

February 21st, 2008 by Jason

We’ve now got a proof of concept of docx4all, our cross-platform WYSIWYG docx editor, ready for you to try. Here is the launch page to run it from your browser. Give it a try in Linux, XP, Vista, or (if you are game) OSX.

This proof of concept includes:

file: new | open | save
text formatting (font, size, bold, italics, underline)
paragraph formatting (alignment)
cut/copy/paste
styles
printing

There is still a lot to do, but with the introduction this week of support for styles, we’ve got a basic feature set which you can use to do actual work (not that we’d recommend that just yet) – assuming you can live without tables (for the moment at least).

It is set up to run offline – it should offer to create a shortcut or a menu item so you can run it again later.

Let us know what you think of it, here or in the forums. Cheers.

Posted in docx4j | Comments Off on Click to try docx4all

.docx to HTML or PDF using Java

January 13th, 2008 by Jason

Doug Mahugh recently mentioned someone using the DocX2Html.xsl that ships with SharePoint to preview DOCX files in HTML.

As it happens, we’ve just implemented HTML and PDF output in docx4j using a similar approach. We’re using the earlier WordML2HTML XSLT stylesheet available from Oleg Tkachenko. (It would be great if Microsoft also made the presumably newer DocX2Html.xsl that ships with SharePoint freely available).

To create the HTML, we use Sun’s xhtmlrenderer (thanks Sun!). See the obligatory tutorial.

To create the PDF, we take the HTML, and run it through Sun’s pdf-renderer (thanks again, Sun). And again, the tutorial.

The icing on the cake is the PDF Viewer which comes with pdf-renderer. That will give us print preview and printing in docx4all.

Finally, thanks Lars for bringing pdf-renderer to my attention.

Posted in docx, docx4j | Comments Off on .docx to HTML or PDF using Java

Howto: create a new document with docx4j

January 11th, 2008 by Jason

I’ve added a page to the wiki, showing how easy it is to programmatically create a new document from scratch.

Posted in docx, docx4j | Comments Off on Howto: create a new document with docx4j

Tutorial: opening an existing document with docx4j

January 11th, 2008 by Jason

I’ve added a page to the wiki, showing how easy it is to programmatically open and edit an existing document.

Posted in docx, docx4j, OOXML | Comments Off on Tutorial: opening an existing document with docx4j