Archive for December, 2007

docx4j trunk now uses JAXB

December 22nd, 2007 by Jason

10 days ago, we created a proof of concept for using JAXB on a subset of wml.xsd (one of the OpenXML schema files).

We’ve declared that a success, and moved it from a branch into the trunk of docx4j. Here be the generated classes.

plutext-server has now been migrated to use it.

And Jo is working with it as he codes docx4all.

So we’re pretty committed at this point!

We’re tidying up bits of the object model as we go (ie editing our xsd to generate Java that we like). So far, paragraphs (p, pPr, r, rPr, t) and structured document tags (sdt, sdtPr, sdtContent) have had our attention.

We’re also making a few changes to the generated classes, so we need to think about how best to prevent those changes from getting lost when the classes are re-generated. There’s a bit of support in XJC for this, and diff may come in handy, but I’d love to hear best practices.

What we have now is an object model for key pieces of the Main Document part (document.xml), in package name
org.docx4.jaxb.document. Next cab off the rank is the Styles part, which we’ll put in org.docx4.jaxb.styles.

“View Page Source” from within Word 2007

December 17th, 2007 by Jason

When developing software which uses WordprocessingML, you often need to look at the XML.

Wouter’s Package Explorer is a great way to do this, particularly if you want to look at an existing file.

Wouldn’t it be great (well, at least a little bit useful), if you could look at the WordML for a document from within Word? Then you could quickly see the WordML produced when you do something in Word (format some text, create a table, add a comment etc).

ActiveDocument.WordOpenXML provides the OpenXML corresponding to the document. plutext-client-word2007 uses this extensively in C#.

Anyway, we can also use it in VB from within Word to open that in an Internet Explorer window, syntax highlighted and with collapsible sections (similar to IE’s default stylesheet for XML documents).

The result:

word2007-viewpagesource.png

The very straightforward code to do this can be cut/pasted from here -use the “download in other formats” links at the bottom of the page. In Word, from the Developer menu > Visual Basic is used to access Word 2007’s Visual Basic IDE. You can then just paste the code into a new module. Create or open a document, then run the VB. That’s all there is to it.

I specifically chose to do it using VB and not VSTO, so you don’t need Visual Studio installed to get this running.

Also I cobbled this code together quickly, and I know it can be improved. If you’d like me to incorporate your improvements, please feel free to send them in!

Running a community – lessons from jaxb.dev.java.net

December 12th, 2007 by Jason

As described in my last post, we’re experimenting with using JAXB to unmarshall/marshall docx documents.

The specification is thorough, and the reference implementation (v2.1.5) seems to work well.

Unfortunately, the same can’t be said of jaxb.dev.java.net.

Given that one of my hats is to develop a community around the plutext projects, I’m trying to be aware of what helps or hinders this process.

So in the spirit of constructive criticism (I’d really like to see momentum grow around JAXB-RI), here are some observations:

  1. there are at least two places to go to for discussion (the mailing lists, and the   Metro and JAXB forum).  Where should you post? Which is going to get the better response? Why two options? In this case, the forum seems more active.
  2. its much harder than it needs to be to get the source code. There is no anonymous CVS (or SVN) access.  You need to be registered, and to have applied for the Observer role.  Then the instructions omit the cvs login step.  Eventually it worked, but in the meantime, it took a bit of digging to find a link to the zipped up sources.  There are outdated blog entries to disregard along the way.
  3. once you do have the source code, and given that JDK 1.6 introduced JAXB 2.0 in rt.jar, there should be prominent instructions for using 2.1 in Eclipse (ie use JDK 1.5)
  4. I couldn’t find JAXB 2.1.5 in Maven repositories. Again, outdated blog entries.
  5. the website is pretty slow

Now, none of these problems will stop the determined user. But I’m sure their cumulative effect is to make many others give up.

For those like me who try to get a quick sense of how active a project is by looking at the volume of traffic on the mailing list or forum before making any further commitment, problem #1 above amounts to bad marketing if nothing else.

This is a pity, because as I said, JAXB 2.1.5 is good stuff.

Docx4j branch: Using JAXB to unmarshall OOXML to Java

December 12th, 2007 by Jason

docx4j contains classes which represent key parts of a WordprocessingML docx.

For example, we have a paragraph class to represent the p element; another to represent a run, etc.

Each class knows how to unmarshall its docx XML representation, and marshall it again.

It will create specialised objects for things it knows how to handle (for example the paragraph content collection contains run objects). For XML we don’t have strongly typed objects for, the class will simply store that XML node, so that it can be round tripped.

Instead of coding these classes one by one by hand, we wanted to see whether one of the Java-XML data binding frameworks could make our lives easier.

Given there is a standard for doing this (JSR 222 – The Java Architecture for XML Binding), we tried the JAXB reference implementation (JAXB 2.1.5).

The JAXB web presence leaves a lot to be desired. I’ll write a post on that shortly.

Having said that, I’m quite impressed with the spec and the reference implementation.

You feed your schema into xjc, and it generates Java classes.

The @XmlAnyElement annotation allows unknown elements to be round tripped, mimicking our existing code.

Why would there be any unknown elements you ask?

The answer is that we are using a subset of the wml.xsd schema from TC45. So there can be a lot of stuff in a docx document which falls outside the subset.

There are a number of reasons we are using a subset:

  1. running the entire schema through XJC produces lots of errors, both in the parsing phase, and once you overcome those, in the compiling stage
  2. more importantly, we’re unlikely to ever implement the entire WordML spec. So it makes sense to work with the subset of key features which are on our roadmap.
  3. you have to add annotations to the schema to ensure the resulting Java classes use names which make sense (this is called customizing the binding).

Anyway, this approach seems to work well. That is:

  • the JAXB version can read a Word document, edit it, and save it again, and Word 2007 can consume the result. See sample.java
  • the resulting classes can be made quite intuitive (though there is more tweaking to do)
  • unknown elements can be round-tripped

The JAXB version of docx4J is in subversion at the following branch:

http://dev.plutext.org/trac/docx4j/browser/branches/jaxb

You can’t just check out the branch and use it right now, since
classes need to be generated. There are maybe 50 generated, but I have
only committed 3 of them.

Where to from here?

If this approach continues to look promising, we are likely to move the JAXB code into the trunk, and upgrade plutext-server and docx4all to use it.

docx4all now in subversion

December 6th, 2007 by Jason

I’m excited to say that today Jojada uploaded his work to date on docx4all to subversion.

Docx4all is our open source word processor which uses OOXML WordprocessingML as its native document format. Like our other projects, we’re releasing it under a GPL (in this case v3).

We intend it to run wherever Swing runs, and both from the desktop and within a web browser.

Docx4all is a thoroughly modern Swing application, in that its built on JavaFX Script and the Swing App Framework.

Here is a screenshot of a simple document rendered in it (click to enlarge), running on Ubuntu:

docx4all v0.1 screenshot

Its very early days yet. As you can see from the screenshot, docx4all can render simple paragraph content. But you can’t actually edit yet. That will change before Christmas.

The philosophy we’re taking is that if docx4all encounters any WordML markup which it doesn’t understand, it should preserve (ie round trip it).

You can see in the screenshot that sectPr currently falls into that category. As I said, its very early days!

But we wanted to get docx4all out there, so that anyone who’d like to work on it is able to get started.

The wiki contains instructions for building docx4all. Let us know how you go in the forums.

Why are we doing this, anyway?

December 5th, 2007 by Jason

The plutext solution enables many users to work on the one Word document at the same time.

Why would you want to do that?

The way we put it in the Wiki:

  1. Get documents finished ahead of deadline. Sales proposals, contracts, reports. Our focus is real time simultaneous collaboration – two or more people working on the document at the same time.
  2. Plutext allows you to continue to use Microsoft Word as your editing environment. You know how to use Word (at least until you installed Office 2007 anyway..).
  3. So you can format the document using Microsoft Word. If you did your collaboration in Google Docs, chances are you’ll have to bring it back into Word to make it pretty. Our collision handling is nicer to.
  4. Work offline. It’s Word, after all.
  5. Word’s docx is our native document format. So there is 100% fidelity. No numbering going haywire.
  6. Complete version history / audit trail.
  7. Don’t have Word? Coming soon … Use docx4all, our WYSIWYG docx editor – on a Mac, on Linux etc.
  8. Oh, and its open source. All GPL 3 (Affero GPL 3 in the case of the server side bits). Use our server (developers only for now), or build your own.

wiki content – getting started with the Plutext Word 2007 add-in

December 5th, 2007 by Jason

A quick post to flag that there is some good content in the wiki now to help developers get started with the Word 2007 client: