Jun 16 2015

Off topic: Eclipse’s maven from a command line?

You’ve installed Eclipse.

Eclipse includes maven (m2e).

Can you use that Maven from outside Eclipse, or do you need to install maven again/separately?

It turns out you can use it.  Whether its worth the effort or not is another question…

You launch maven using the plexus classworlds launcher.

That needs a config file.

The config file (call it m2.conf) contains something like:

main is org.apache.maven.cli.MavenCli from plexus.core

set maven.home default /home/jason/.m2

[plexus.core]
load /home/jason/eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/*.jar

With that, from a project dir, the following is the equivalent of ‘mvn install’:

java -cp ../eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/plexus-classworlds-2.5.1.jar:../eclipse/plugins/org.slf4j.api_1.7.2.v20121108-1250.jar  “-Dclassworlds.conf=m2.conf” org.codehaus.plexus.classworlds.launcher.Launcher install

You could make a shell script to do that.  And you could base the shell script on the one included in the maven distribution.  Or more sensibly, you’d just download  install and use  maven proper.

If you need it.  Right clicking on your project in Eclipse then Run As gives you a handy UI for maven-related stuff:

Finally, note it is  possible to avoid the plexus classworlds launcher by invoking MavenCli directly:

java -cp “../eclipse/plugins/org.slf4j.api_1.7.2.v20121108-1250.jar:../eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/*” org.apache.maven.cli.MavenCli install

If you are going to use that, you might wonder about setting maven.home with -Dmaven.home=/your/path/to/where immediately after the classpath

Jun 16 2015

docx4j from GitHub in Eclipse – 5 years on

In May 2012 we posted docx4j-from-github-in-eclipse. That was 5 years ago now, so its about time to update that walkthrough :-)

This post is about getting the docx4j source code setup in Eclipse, so you can not just use it, but easily study it as well (and submit pull requests!).  If you have no interest or need to do that, please see hello-maven-central (if you’re already using a recent Eclipse, you can start at step 4) and/or docx4j-3-0-and-maven (but do use our current version 3.2.x)

Preliminaries – JDK

Make sure you have the JDK installed; Java 6 or later.  The JRE alone is not enough, since it doesn’t include a compiler (javac).

Preliminaries – Eclipse

Install Eclipse.  These days, the basic package has everything you need (ie git and maven support):

Git & GitHub

GitHub is docx4j’s authoritative source repository.  Eclipse now includes a git client. (If you have an older Eclipse, you can install eGit) However, it is still handy to have other git clients installed:

  • on Linux (listed first, given git’s provenance), install git using your distribution’s package manager
  • on Windows, the Git BASH shell is handy; as is Atlassian’s SourceTree
  • on OSX, ditto

Clone or Fork?

With Git, there is a difference between cloning and forking.

  • Cloning gives you a copy of the source code you can work on, but without more, no easy way to contribute changes back.
  • Forking sets you up with the source code, and makes it easy to contribute changes back.

If you think you might be making changes to the docx4j source code, you’re probably best to create a fork on GitHub right from the start.

To create a fork, log in to GitHub, visit https://github.com/plutext/docx4j then press the “Fork” button.

Choose your poison

There are 3 steps to installing docx4j:

  1. clone the docx4j repo
  2. install its dependencies
  3. install docx4j project in Eclipse

You can do these 3 steps entirely within Eclipse, but Eclipse by default doesn’t give much feedback as to what its doing, so you might wonder whether its still working properly.

Since its just as easy (or easier) to use the command line, I’ll show that way first:-

Command Line Approach

To do it this way, you’ll need:

  • a git shell, and
  • Maven

Both of these are worth having in any case.

Step 1. To clone docx4j from your git shell, use the github URL for docx4j (your fork or Plutext’s):

$ git clone -b master –single-branch https://github.com/plutext/docx4j docx4j
Cloning into ‘docx4j’…
remote: Counting objects: 42008, done.
remote: Compressing objects: 100% (58/58), done.
remote: Total 42008 (delta 23), reused 7 (delta 0), pack-reused 41946
Receiving objects: 100% (42008/42008), 61.03 MiB | 128.00 KiB/s, done.
Resolving deltas: 100% (25108/25108), done.

You should now have a docx4j directory, containing the docx4j source code.

Step 2. Next, to get docx4j’s dependencies, you’ll need Maven.

So first, install Maven (if you don’t have it already).  Please see the instructions at maven.apache.org (actually, you’ve already got Maven in Eclipse, but its a bit hard to use from the command line).

Now you can go into your docx4j directory, and type:

mvn install -DskipTests=true

You’ll see Maven download docx4j’s dependencies.

Step 3. Now you are ready to start Eclipse.

Because docx4j includes Eclipse project definition files, you can import the docx4j project.

From the File menu, click Import, then Existing Projects into Workspace:

Browse to your docx4j directory:

Then click Finish.

Now the project should be set up correctly.  If you see errors, please refer further below for troubleshooting.

Eclipse only approach

Step 1. clone the docx4j git repo in Eclipse; for this you need the Git Repositories View:

Window > Show View > Other > Git > Git Repositories View

Click “Clone a Git repository” then enter the URI for docx4j (your fork, or Plutext’s), then click Next.

The master branch is probably all you need (though Eclipse will probably fetch all the others at some point anyway!)

Step 2. On the next screen, you can tick “Import all existing projects after clone finishes“.  (if you don’t do that, you’ll have to manually File > Import, then Existing Projects into Workspace, as explained above)

Step 3. Eclipse will now start building the project; first Maven will get the dependencies.

This may take a while … to see what Eclipse is doing while it displays the status “Building workspace”, from the Console view, click the drop down to see the Maven Console:

There you can watch it downloading stuff.

You can also look at the Progress view.

When its all done, you should have a docx4j project there and ready to go!

Troubleshooting

I don’t cover issues with git clone or maven here; just issues with Eclipse.

If Eclipse has a problem with your docx4j project, you’ll see an exclamation mark:

You can see further info in the Problems view; the most likely problem is that your Java is misconfigured:-

To fix this, on the docx4j project, click Alt-Enter to go into its properties.

Then click Java Build Path, then the Libraries tab.

Do you see a red cross next to JRE System Library, as above?

If so click on the JRE System Library entry to select it, then click the Remove button.

Next click Add Library, the JRE System Library, then add one (1.6 or above).

Note the warning:

That’s OK, we changed the JRE on the Java Build Path up above.

Hello World

Now you are ready to run some docx4j code.

A good place to start is to run  CreateWordprocessingMLDocument

Use docx4j in your own project

To use docx4j in your own project, there are 2 approaches:

  • the Maven way.  If you’re planning to use Maven, you just specify docx4j as a dependency, and if the version matches (look in pom.xml), it’ll use your docx4j project (assuming workspace resolution is switched on).  Please see hello-maven-central (if you’re already using a recent Eclipse, you can start at step 4) and/or docx4j-3-0-and-maven (but do use the version specified in pom.xml)
  • or, via the Java Build Path > Projects tab.

Feb 10 2015

High fidelity PDF output

This post introduces our new commercial component for docx to PDF output.

The background is that docx4j’s standard method of producing PDF output has been via XSL FO, using Apache FOP.

This has worked well enough for some docx4j users, but it has certain limitations which can bite you, for example lack of tab and tab stop support in XSL FO.

And because there are differences between FOP’s layout engine and Word’s, page breaks may fall in different places.

This means the FO based PDF output in docx4j is about as good as its going to get (short of enhancing the FO renderer).

To do better, we’ve had to invest in a non-FO approach, using layout algorithms specifically designed to give the same results Word does.

You can try it now.

A side benefit is that this new approach is much faster than the FO approach.

The component is actually independent of docx4j.  This means it’ll also work great  if you need to convert docx to PDF from C# (without Word), Python, PHP etc.

Pricing is at plutext.com

Jan 20 2015

Content controls for business data connectivity

Sometimes, Word is a natural way for people to interact with back end applications (eg SAP).

This is particularly so when:

  • business data will be output to a Word document,
  • the user is more familiar with Word than the other system,
  • certain data updates may be required (and are permitted)

So maybe there are 4 high level categories:

  • apps which support commercial transactions (a recipient will receive a Word document), eg
    • employment onboarding (letter of employment)
    • invoicing
  • apps with a reporting component:  is the report format a natural interface if it was made bi-directional?
  • workflow/BPM systems which present documents (work orders, proposals, approvals etc)
  • policy/procedures in regulated industries, where a worker must follow a series of steps.  Can present that in a docx; the worker can tick the steps off as they do them
    • related training scenarios?

Microsoft had an emphasis on what they then called “Office Business Applications” back around the Office 2007 launch.  Fast forward to today, and “business connectivity services” are part of SharePoint.

But you can achieve the same sort of thing without SharePoint, using docx4j and data bound content controls.

Once you have your back end data in an XML format (and there are many tools/techniques to help with this), you can use content controls to bind what the user sees in Word to elements in that XML.

The beauty of it is that the binding is bi-directional, so if the user edits the document, the XML is updated (ie it stays in synch).

After the user has made their edit, you can update the back end application.  Typically, you’d do this after they saved & closed the document (ie outside Word, using docx4j), but you could also do it from within Word (a less good approach, but still, an option).

What if there is some data which the user shouldn’t be able to edit?  You simply lock the content control to prevent editing.

To quickly try out this approach, put together some sample XML, then upload it as explained here, to get a docx you can experiment with.

We’d loved to hear about how you might use this approach?

Jan 20 2015

I have my XML, now what?

A barrier to using  content control data binding has been getting a feel for what the solution might look like.  You have your XML data, but what do you do next?

The audience for this post is broader than docx4j users; its for anyone wanting to easily set up a template docx using XML data binding, just to get a feel for what it is like from an authoring/Word perspective.

How do you add content controls to your Word document, and map them to the XML data?

For docx4j purposes, you use an OpenDoPE Word AddIn (of which there are two you can try).

Alternatively, Word 2013 introduced the XML mapping task pane (and some additional file format features) .  The docx linked in my previous blog post highlighted some usability issues with the XML mapping pane.

This post presents a “fast start” way you can try.  Simply upload your XML file here, and it will give you back a docx, with content controls mapped to that sample XML.

To take the following XML for an invoice as an example:

You’d get back the following Word 2013 docx (note that Word 2010 silently strips the repeats without warning):

In “design mode”, it looks like:

Whether you’re in design mode or not, you can edit the content of a content control, then to satisfy yourself that things are working correctly, save the docx, then unzip it and see your altered data in /customxml/item1.xml

This idea is that once you have the basic docx, you can quickly/easily edit it in Word to make it prettier:

The tool recognises and uses the following content control types:

  • repeats: automatically detect repeats and set up suitable content control structures (we use a table if there’d be more than one column) fully populated with the XML data
    • note that these are Word 2013 repeatingItem structures, not OpenDoPE repeats.  (We recommend you use OpenDoPE repeats, but the tool creates Word 2013 repeatingItems, so you can see what Word 2013 users get out of the box).
  • pictures/images: will use a picture content control if the field contains base64 encoded data, or the string PICTURE or string IMAGE
  • checkbox: generated if value is true or false
  • escaped Flat OPC XML: a bound rich text content control will be inserted if the element contains the string FLAT-OPC or the string   ?mso-application progid=”Word.Document
  • date control: used if the element content is the string DATE

Have fun.

Jan 17 2015

Word 2013 repeatingSection content controls – ready for prime time?

For developers wondering about Microsoft’s commitment to content controls, Office 2013 was certainly good news.

In Microsoft’s “What’s new for Word 2013 developers”, 2 of the 4 items were about content controls:

  • Enhancements to content controls
  • UI for XML mappings

And in http://blogs.office.com/2012/10/17/top-5-reasons-developers-will-love-word-2013/ 4 of the 5 reasons relate to content controls!

The MSDN article “What’s new with content controls in Word 2013” describes the changes in detail, but one of them was the introduction of repeating section content controls, which are comparable to OpenDoPE repeats (which docx4j has supported for ages).

The question is whether the time is ripe to migrate from OpenDoPE repeats to Word 2013 repeatingSection content controls?

My suggested answer is “no, or at least, not yet”, because

  1. Word 2010 strips out Word 2013 repeating content controls, and does so without warning!  Compare OpenDoPE repeats, which work in Word 2007, 2010 and 2013.  So until Word 2010 becomes irrelevant (or support is back ported), Word 2013 repeating content controls can’t be used in a generic solution.
  2. Word 2013 doesn’t handle the case of repeat zero times as you’d expect; it leaves a single instance, which will cause problems in many applications.

For authoring, the XML Mapping Pane in Word 2013 also leaves a bit to be desired.  For more details, please see w15RepeatingSection_cf_OpenDoPE.docx

Even so, docx4j 3.2.2 will support processing Word 2013 repeating content controls, for those who still choose to use them.

Dec 04 2014

Docx4jHelper Word AddIn

The dream:

  • View Open XML right from within Word, and see what happens when you edit it.
  • Or generate corresponding docx4j Java code, with deep links into the corresponding docx4j source code and Open XML spec.

Regular users of docx4j will be aware of our webapp, which amongst other things, generates docx4j Java code for the specified Open XML in your sample docx/pptx/xlsx.

The webapp is useful, but it has a few draw backs:

  • you have to upload your docx/pptx/xlsx, which takes time
  • if your docx/pptx/xlsx contains sensitive data, you probably want to remove that first
  • the webapp might be down

To address these issues, we’re now offering the code gen functionality as a Word AddIn.

If you install the Word AddIn, this means you can now generate code without your docx leaving your computer.

This is all feasible because docx4j can run as a DLL in a .NET project, thanks to IKVM!

Where to get it

You can download the installer.  After you complete the landing form (using your corporate email address, not gmail etc), you’ll be sent a download link.

Getting Started

After a successful installation, after restarting Word, you should see a “Docx4j” menu, containing:

To generate code, first press the “Load Helper” button.

You’ll see the following form:

Its inviting you to start a local web server which will run the same code as the existing webapp.  Just choose a port you aren’t already using.  If for some reason you want to browse using Internet Explorer (as opposed to your default browser), check the box.

It’ll take a little while to start the server; you’ll see a dialog when its started.

Now you can generate code.  To do so, select something in your docx, then click the “Generate Code” button.

After a while, a window will open in your web browser, and you’ll see:

That’s the view of the docx package, which will be familiar if you’ve used the webapp.   For how to generate code from here, see our earlier post.

Code generating is done on your computer.  (But note, the links on that page to docx4j source code and the OpenXML spec are external links)

What about the “Edit OpenXML” button?

If you select something in your docx, then click that button, after a while (maybe 30 secs the first time!), you’ll see the corresponding XML in an editor window:

You can go ahead and edit it, then click the “Apply” button.

If Word likes your XML, you’ll see your changes on the document surface.  Ctrl Z should work for undo.

So there are 2 ways to see the underlying XML

The first way we described uses your web browser; the second is a Windows Form.

These two views have different features; maybe a later release will unify them?

What about pptx, xlsx?

There’s no reason in principle we couldn’t make a similar AddIn for Powerpoint and Excel.  In fact, we plan to make these, once any teething issues have been ironed out in the WordAddIn.

In the meantime, for pptx and xlsx, you can continue to use the webapp.

Help, Suggestions and other Discussion

If you are a Plutext customer experiencing an issue, please email support@plutext.com

Otherwise, please check the Docx4jHelper AddIn forum.

We’ve got some ideas for where the AddIn goes from here, but we’d love to hear yours.

Oct 04 2014

Web-based docx editing?

Following on from the previous post on content tracking, some people have been asking about how to edit a docx in a web browser.

So I thought I’d link to a proof of concept we did a year or so ago.

The idea is:

  • use docx4j to convert the docx to XHTML
  • use CKEditor to edit that XHTML in the web browser
  • on submit, convert the XHTML back to docx content

The general problem with converting to/from XHTML is the “impendance mismatch”.  That is, losing stuff during round trip.  This will be a familiar problem to anyone who has ever edited a docx in Google Docs or LibreOffice.

This demo addresses that problem by identifying docx content which CKEditor would mangle, and then on submit/save, using the original docx content for those bits.

In this demo, the problematic content is replaced with visual placeholders, so you can see it is there.

The intent is that you can add/edit text content in the browser, without other document content (headers/footers, text boxes etc) getting lost.

To give it a try, go to the upload page and choose a docx file from your computer

You should see your docx open with the CKEditor toolbars above it:

(In the demo and screenshot above, the grey “B” image represents a bookmark)

Make some edits, then hit the Submit button (at the bottom).

The docx will be streamed back to your computer as a download in your browser.

Now open it in Word, and compare it to the original.

Feedback

If you want to add this type of functionality to your application, please let us know by emailing jharrop@plutext.com

We’d love to hear:

  • a bit more about your use case,
  • where you see your users doing their web-based editing:- on your intranet, extranet, or the web at large?
  • what kind of editing? is it proof reading,  customising particular sections, a step in a workflow..?
  • do you need to cater for iPads or Android tablets?  And if so, is a dedicated app on your roadmap?
  • any additional requirements you might have!

Sep 08 2014

XHTML-docx roundtrip: content tracking

There are a couple of common use cases for docx4j’s XHTML import capability:

The first is enabling a webapp with HTML reporting to output/export reports in Word’s docx format.  With docx4j, you can get really nice results doing this, especially if your XHTML has @class which map to Word styles.

The second – to support web based editing – is the subject of this post.  In a full incarnation, the vision is:

  • be able to edit the content in Word or in the web browser (using an XHTML editor such as CKEditor)
  • track chunks of content, perhaps for workflow/approval processes, version control, or re-use

docx4j can help you with this vision in a Java or .NET (eg C#) environment.

Web based XHTML editing is well understood, so here I’ll focus on tracking chunks of content.

In XHTML, its straightforward.  You can add div elements (eg <div id=”contentXYZ”>) to your heart’s content.  And you can nest them (think book, chapter, section, sub-section).

How to track that ID to or from docx format?

The answer: content controls.

Bookmarks are another possibility, but I wouldn’t recommend them for this purpose, because it is easy for a user to delete them, or inadvertently insert extra bookmarks.  They lack the rich features of content controls (eg locking), and aren’t very “XMLy” (they are pairs of start and end point tags which create additional challenges).

So, back to content controls.

Content controls are analogous to divs.  They have IDs; you can nest them; etc.

Content controls aside, the docx file format is flat.  Its a sequence of paragraphs and tables.  Its only inside tables that paragraphs also appear (and nested tables).

So, all we need to do convert divs to content controls, and vice versa.

This post tells you how to do that with docx4j.

XHTML to docx (div to content control)

For XHTML to docx, you use docx4j-ImportXHTML

div to content control support was added after 3.2.0’s release, in this commit.  So for now, you need to build from source, or use a nightly build.

Once you have that, to use it, do something like:

XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
XHTMLImporter.setDivHandler(new DivToSdt());

That implementation will convert div elements to content controls, and place @id and @class values into the content control’s w:tag, for example “class=class1&id=myid”

You can extend DivToSdt with any extra functionality/logic you might require, such as locking the content control for editing/deletion.

docx to XHTML (content control to div)

The content control to div functionality has been present for a lot longer.

For that, you use docx4j to generate XHTML output in the usual way, but first you invoke SdtWriter.registerTagHandler

See the sample DivRoundtrip.java for a fully worked example of divs to content controls, then back to divs again.

The tag handler concept is to treat the content of the w:tag like an HTTP query string (key value pairs).

A tag handler is registered for a specific key (eg ‘id’, ‘class’) or the wildcards (‘*’, ‘**’), and will only execute if the key is found in the w:tag.

For this example, we want our tag handler to insert a div depending on both class and id keys, so we register it as ‘*’ (we don’t want 2 handlers, which might result in 2 divs).

A tag handler with double asterisk ‘**’ will always be applied if you need that.  See the SdtWriter source code for definitive behaviour.

Sep 05 2014

C#/.NET: Import XHTML into docx without Word

How to convert import HTML into a Word document without using Microsoft Word?

Honouring the CSS, so the Word document looks similar to the input XHTML.  Alternatively, converting @class values to Word styles.

Its a common requirement in our increasingly web-centric world.

docx4j-ImportXHTML.NET is open source (LGPL v2.1 or later), identical to the Java version, but made into a DLL using IKVM.  Currently we’re at v3.2.0, released last week.

It is easy to test; with very little effort, you can run it from a sample project in Visual Studio.  Its very easy, because docx4j-ImportXHTML.NET is in the NuGet.org repository:

To create your sample project:

  1. make sure you have NuGet Package Manager installed
    • for VS 2012 and later, its installed by default
    • for VS 2010, NuGet is available through the Visual Studio Extension Manager; see the above link.
  2. create a new project in Visual Studio (File > New > Project).  A Console Application is fine.  I chose that from the .NET 3.5 list.
  3. from the Tools menu, choose NuGet Package Manager > Package Manager Console
  4. type Install-Package docx4j-ImportXHTML.NET

You should see something like:

And then, your project/solution will be populated to look like:

We’re nearly there!  Notice the docx4j-ImportXHTML DLL, and the file src/samples/c_sharp/docx/ConvertInXHTMLFragment.cs.  Most of the rest of the stuff comes from the docx4j dependency, which NuGet fetches.

If you have a look at ConvertInXHTMLFragment.cs, you’ll see it contains

Let’s run it, to convert that xhtml to docx content.

Click on your project in Solution Explorer, then right click (or hit Alt+Enter) to get the properties pane:

Then set the “startup object” as shown in the above image.

Now you can hit Ctrl+F5 (“Start without Debugging”) – you don’t want to debug, since that’s really slow.

You should see some logging in the console window, culminating in something like:

You can see there the WordML equivalent for the tail of the XHTML list we were converting.

Obviously, you can modify src/samples/c_sharp/Docx4NET/DocxToPDF.cs to read your own XHTML.

A few comments.

Well formed XML! Only well formed XML works, ie XHTML, not tag-soup HTML.  If you have tag soup, its your responsibility to convert that to XHTML with some tidy tool.   You’ll get a SAXParseException if your input is not well formed.

Word styles: if the target docx contains a style matching @class, it can be used.  This’ll be the subject of a separate blog post.

Other examples: the Java repository on GitHub contains examples for reading from a file etc.  Converting these to C# is left as an exercise for the reader.  If you do that, we’d be delighted to receive a pull request on https://github.com/plutext/docx4j-ImportXHTML.NET

Logging, Commons Logging. Logging is via Commons Logging.  In the demo, it is configured programmatically (ie in  DocxToPDF.cs).  Alternatively, you could do it in app.config.

OpenXML SDK interop: src/main/c_sharp/Plutext/Docx4NET contains code for converting between a docx4j representation of a docx package, and the Open XML SDK’s representation.

Improving XHTML import support. To implement a new feature in the XHTML import, typically you’d make the improvement to docx4j-ImportXHTML first (ie the Java version), then create a new DLL using the ant build target dist.NET.   docx4j-ImportXHTML is on GitHub, and is most easily setup using Maven (see earlier blog post).

Alternatives. There are a couple of projects on CodePlex you could try:

I’d be interested in feedback on how they compare.

Help/support/discussion. You can post in the docx4j XHTML import forum, or on StackOverflow (be sure to use tag docx4j, plus some/all of c#, docx, xhtml etc as you think appropriate).  Please don’t cross post at both!