Oct 04 2014

Web-based docx editing?

Following on from the previous post on content tracking, some people have been asking about how to edit a docx in a web browser.

So I thought I’d link to a proof of concept we did a year or so ago.

The idea is:

  • use docx4j to convert the docx to XHTML
  • use CKEditor to edit that XHTML in the web browser
  • on submit, convert the XHTML back to docx content

The general problem with converting to/from XHTML is the “impendance mismatch”.  That is, losing stuff during round trip.  This will be a familiar problem to anyone who has ever edited a docx in Google Docs or LibreOffice.

This demo addresses that problem by identifying docx content which CKEditor would mangle, and then on submit/save, using the original docx content for those bits.

In this demo, the problematic content is replaced with visual placeholders, so you can see it is there.

The intent is that you can add/edit text content in the browser, without other document content (headers/footers, text boxes etc) getting lost.

To give it a try, go to the upload page and choose a docx file from your computer

You should see your docx open with the CKEditor toolbars above it:

(In the demo and screenshot above, the grey “B” image represents a bookmark)

Make some edits, then hit the Submit button (at the bottom).

The docx will be streamed back to your computer as a download in your browser.

Now open it in Word, and compare it to the original.

Feedback

If you want to add this type of functionality to your application, please let us know by emailing jharrop@plutext.com

We’d love to hear:

  • a bit more about your use case,
  • where you see your users doing their web-based editing:- on your intranet, extranet, or the web at large?
  • what kind of editing? is it proof reading,  customising particular sections, a step in a workflow..?
  • do you need to cater for iPads or Android tablets?  And if so, is a dedicated app on your roadmap?
  • any additional requirements you might have!

Sep 08 2014

XHTML-docx roundtrip: content tracking

There are a couple of common use cases for docx4j’s XHTML import capability:

The first is enabling a webapp with HTML reporting to output/export reports in Word’s docx format.  With docx4j, you can get really nice results doing this, especially if your XHTML has @class which map to Word styles.

The second – to support web based editing – is the subject of this post.  In a full incarnation, the vision is:

  • be able to edit the content in Word or in the web browser (using an XHTML editor such as CKEditor)
  • track chunks of content, perhaps for workflow/approval processes, version control, or re-use

docx4j can help you with this vision in a Java or .NET (eg C#) environment.

Web based XHTML editing is well understood, so here I’ll focus on tracking chunks of content.

In XHTML, its straightforward.  You can add div elements (eg <div id=”contentXYZ”>) to your heart’s content.  And you can nest them (think book, chapter, section, sub-section).

How to track that ID to or from docx format?

The answer: content controls.

Bookmarks are another possibility, but I wouldn’t recommend them for this purpose, because it is easy for a user to delete them, or inadvertently insert extra bookmarks.  They lack the rich features of content controls (eg locking), and aren’t very “XMLy” (they are pairs of start and end point tags which create additional challenges).

So, back to content controls.

Content controls are analogous to divs.  They have IDs; you can nest them; etc.

Content controls aside, the docx file format is flat.  Its a sequence of paragraphs and tables.  Its only inside tables that paragraphs also appear (and nested tables).

So, all we need to do convert divs to content controls, and vice versa.

This post tells you how to do that with docx4j.

XHTML to docx (div to content control)

For XHTML to docx, you use docx4j-ImportXHTML

div to content control support was added after 3.2.0′s release, in this commit.  So for now, you need to build from source, or use a nightly build.

Once you have that, to use it, do something like:

XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
XHTMLImporter.setDivHandler(new DivToSdt());

That implementation will convert div elements to content controls, and place @id and @class values into the content control’s w:tag, for example “class=class1&id=myid”

You can extend DivToSdt with any extra functionality/logic you might require, such as locking the content control for editing/deletion.

docx to XHTML (content control to div)

The content control to div functionality has been present for a lot longer.

For that, you use docx4j to generate XHTML output in the usual way, but first you invoke SdtWriter.registerTagHandler

See the sample DivRoundtrip.java for a fully worked example of divs to content controls, then back to divs again.

The tag handler concept is to treat the content of the w:tag like an HTTP query string (key value pairs).

A tag handler is registered for a specific key (eg ‘id’, ‘class’) or the wildcards (‘*’, ‘**’), and will only execute if the key is found in the w:tag.

For this example, we want our tag handler to insert a div depending on both class and id keys, so we register it as ‘*’ (we don’t want 2 handlers, which might result in 2 divs).

A tag handler with double asterisk ‘**’ will always be applied if you need that.  See the SdtWriter source code for definitive behaviour.

Sep 05 2014

C#/.NET: Import XHTML into docx without Word

How to convert import HTML into a Word document without using Microsoft Word?

Honouring the CSS, so the Word document looks similar to the input XHTML.  Alternatively, converting @class values to Word styles.

Its a common requirement in our increasingly web-centric world.

docx4j-ImportXHTML.NET is open source (LGPL v2.1 or later), identical to the Java version, but made into a DLL using IKVM.  Currently we’re at v3.2.0, released last week.

It is easy to test; with very little effort, you can run it from a sample project in Visual Studio.  Its very easy, because docx4j-ImportXHTML.NET is in the NuGet.org repository:

To create your sample project:

  1. make sure you have NuGet Package Manager installed
    • for VS 2012 and later, its installed by default
    • for VS 2010, NuGet is available through the Visual Studio Extension Manager; see the above link.
  2. create a new project in Visual Studio (File > New > Project).  A Console Application is fine.  I chose that from the .NET 3.5 list.
  3. from the Tools menu, choose NuGet Package Manager > Package Manager Console
  4. type Install-Package docx4j-ImportXHTML.NET

You should see something like:

And then, your project/solution will be populated to look like:

We’re nearly there!  Notice the docx4j-ImportXHTML DLL, and the file src/samples/c_sharp/docx/ConvertInXHTMLFragment.cs.  Most of the rest of the stuff comes from the docx4j dependency, which NuGet fetches.

If you have a look at ConvertInXHTMLFragment.cs, you’ll see it contains

Let’s run it, to convert that xhtml to docx content.

Click on your project in Solution Explorer, then right click (or hit Alt+Enter) to get the properties pane:

Then set the “startup object” as shown in the above image.

Now you can hit Ctrl+F5 (“Start without Debugging”) – you don’t want to debug, since that’s really slow.

You should see some logging in the console window, culminating in something like:

You can see there the WordML equivalent for the tail of the XHTML list we were converting.

Obviously, you can modify src/samples/c_sharp/Docx4NET/DocxToPDF.cs to read your own XHTML.

A few comments.

Well formed XML! Only well formed XML works, ie XHTML, not tag-soup HTML.  If you have tag soup, its your responsibility to convert that to XHTML with some tidy tool.   You’ll get a SAXParseException if your input is not well formed.

Word styles: if the target docx contains a style matching @class, it can be used.  This’ll be the subject of a separate blog post.

Other examples: the Java repository on GitHub contains examples for reading from a file etc.  Converting these to C# is left as an exercise for the reader.  If you do that, we’d be delighted to receive a pull request on https://github.com/plutext/docx4j-ImportXHTML.NET

Logging, Commons Logging. Logging is via Commons Logging.  In the demo, it is configured programmatically (ie in  DocxToPDF.cs).  Alternatively, you could do it in app.config.

OpenXML SDK interop: src/main/c_sharp/Plutext/Docx4NET contains code for converting between a docx4j representation of a docx package, and the Open XML SDK’s representation.

Improving XHTML import support. To implement a new feature in the XHTML import, typically you’d make the improvement to docx4j-ImportXHTML first (ie the Java version), then create a new DLL using the ant build target dist.NET.   docx4j-ImportXHTML is on GitHub, and is most easily setup using Maven (see earlier blog post).

Alternatives. There are a couple of projects on CodePlex you could try:

I’d be interested in feedback on how they compare.

Help/support/discussion. You can post in the docx4j XHTML import forum, or on StackOverflow (be sure to use tag docx4j, plus some/all of c#, docx, xhtml etc as you think appropriate).  Please don’t cross post at both!


Sep 05 2014

docx to PDF in C#/.NET

How to convert docx to PDF without using Microsoft Word?

If you docx is mainly text, tables and images, docx4j.NET may work well for you.  docx4j.NET is open source (Apache software license v2), identical to the Java version, but made into a DLL using IKVM.  Currently we’re at v3.2.0, released last week.

It is easy to test; you can upload your docx to the docx4j demo webapp

Or with very little effort, you can run it from a sample project in Visual Studio.  Its very easy, because docx4j.NET is in the NuGet.org repository:

To create your sample project:

  1. make sure you have NuGet Package Manager installed
    • for VS 2012 and later, its installed by default
    • for VS 2010, NuGet is available through the Visual Studio Extension Manager; see the above link.
  2. create a new project in Visual Studio (File > New > Project).  A Console Application is fine.  I chose that from the .NET 3.5 list.
  3. from the Tools menu, choose NuGet Package Manager > Package Manager Console
  4. type Install-Package docx4j.NET

You should see something like:

And then, your project/solution will be populated to look like:

We’re nearly there!  Notice the file src/samples/c_sharp/Docx4NET/DocxToPDF.cs

Click on your project in Solution Explorer, then right click (or hit Alt+Enter) to get the properties pane:

Then set the “startup object” as shown in the above image.

Now you can hit Ctrl+F5 (“Start without Debugging”) – you don’t want to debug, since that’s really slow.

You should see some logging in the console window, culminating in “done! Press any key to continue..”

What just happened?  All being well, the sample docx “src\samples\resources\sample-docx.docx” was saved as a PDF “OUT_sample-docx.pdf” in your project directory.

You can modify src/samples/c_sharp/Docx4NET/DocxToPDF.cs to read your own test docx.

A few comments.

XSL FO; Apache FOP. docx4j creates PDF via XSL FO.  It generates XSL FO, then uses Apache FOP (v1.1) to convert the XSL FO to PDF.  FOP also supports other output formats (the subject of another blog post).

Logging, Commons Logging. Logging is via Commons Logging.  In the demo, it is configured programmatically (ie in  DocxToPDF.cs).  Alternatively, you could do it in app.config.

OpenXML SDK interop: src/main/c_sharp/Plutext/Docx4NET contains code for converting between a docx4j representation of a docx package, and the Open XML SDK’s representation.

Improving PDF support. To improve the quality of the PDF output, typically you’d make the improvement to docx4j first (ie the Java version), then create a new DLL using the ant build target dist.NET.   docx4j is on GitHub, and is most easily setup using Maven (see earlier blog post).

Help/support/discussion. You can post in the docx4j PDF output forum, or on StackOverflow (be sure to use tag docx4j, plus some/all of c#, docx, pdf, fop, xslfo as you think appropriate).  Please don’t cross post at both!


May 12 2014

SQL Server Reporting Services (SSRS) emits dodgy Word docx documents

By now we’re used to products which emit docx files which are umm, not .. quite .. right.

But its more noteworthy when the product in question is from Microsoft.  After all, its their file format (ECMA etc standardisation notwithstanding).

The product in question here is SQL Server Reporting Services 2012 and its Word export.

It seems they didn’t bother to validate their documents (eg using Open XML SDK 2.0 Productivity Tool):

Apparently there’s a reason for this:

“Word and SSRS treat page headers and footers differently. Word actually positions them inside the page margins, whereas SSRS positions them inside the area that the margins surround. As a result, in Word, the page margins do not control the distance between the top edge of the page and that of the page header (or similarly for the page footer). Instead, Word has separate “Header from Top” and “Footer from Bottom” properties to control those distances. Since RDL does not have equivalent properties, the Word renderer sets these properties to zero.”
But the problem is that it is actually setting them to blank (as opposed to zero), which is not valid.

Another problem:

JAXB doesn’t like invalid documents, so docx4j has to fix these sorts of things before it can construct a content model.  (Maybe that’s why SSRS calls it Word export, not docx export:- they just check Word can open the document, then call it job done)

There are other problems with SSRS docx which the Productivity Tool doesn’t report.

Take a look at the styles part:

Notice anything wrong?  It’d be better if the EmptyCellLayoutStyle had @w:styleId and @w:type, like so:

It’d also be nice if it defined the “Normal” style it is basedOn!

docx4j and other consumers could/should detect such problems and degrade gracefully in the face of them, but Microsoft (of all companies!) should exercise better quality control.

Mar 16 2014

docx4j and Google Drive

Given the news this week about Google lowering prices per GB on Google Drive, I thought it would be timely to explore interop with docx4j.

https://github.com/plutext/docx4j-cloud-GoogleDrive is a small project which demonstrates:

Clone the project, and set it up using Maven in your IDE.  I’m not going to tell you how to do that.
Enabling the Drive API
From there, it is fairly straightforward  (assuming you have a Google account); you just need to enable the Drive API: set up a project and application in the Developers Console:
  • press the red “CREATE NEW CLIENT ID” button, then choose application type “Installed Application”; I then chose subtype “Other”
  • hit the “Download JSON” button; save it as client_secret.json in your project dir

Run our code

OK, now try running Docx4jUploadToGoogleDrive

It ought to say something like:

Please open the following URL in your browser then type the authorization code:
https://accounts.google.com/o/oauth2/auth?access_type=online&client_id=622239…

Paste the auth code into your IDE’s console (System.in, probably the same place which displayed the above message) then press enter.  If you aren’t logged into your Google account in your browser, its at this point that you’ll be asked to log in.

The code will create a new docx file, and after uploading it, if successful, report the File ID allocated by Google Drive:

File ID: 0CyHdofN18p16OF9YWWNFUFdmTjg

The other 2 samples require you to provide an auth code the same way (each time you run them).  Obviously, you’d be more sophisticated than this in a production application.  See further https://developers.google.com/drive/web/about-auth

Nov 28 2013

docx4j 3.0 and Maven

blog/2011/10/hello-maven-central/ walks you through the basics of using docx4j in an Eclipse project with the help of m2eclipse.

This post is about the different ways you can set up docx4j 3.0 with the help of Maven.

We’ll be using the following skeleton pom.xml:


<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
	<modelVersion>4.0.0</modelVersion>

	<groupId>your.group</groupId>
	<artifactId>your.artifactp</artifactId>
	<name>nameless</name>
	<version>0.0.1-SNAPSHOT</version>
	<description>
		some description
	</description>

	<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
	</properties>

	<build>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-dependency-plugin</artifactId>
				<version>2.0</version>
			</plugin>
		</plugins>
	</build>

	<dependencies>

		<!-- dependencies go here -->

	</dependencies>

</project>

Adding the core dependency

To use docx4j, including its LGPL XHTML import capability, just include the following dependency in your pom.xml:


		<dependency>
			<groupId>org.docx4j</groupId>
			<artifactId>docx4j-ImportXHTML</artifactId>
			<version>3.0.0</version>
		</dependency>
That’ll drag in docx4j, and all the other dependencies (you should be able to see then in Eclipse under Maven Dependencies, or by running mvn dependency:tree at a command prompt).
If you don’t want the XHTML import stuff, just use:

		<dependency>
			<groupId>org.docx4j</groupId>
			<artifactId>docx4j</artifactId>
			<version>3.0.0</version>
		</dependency>
(You should consider adding a docx4j.properties to your classpath)
Logging
Both of the above default to using log4j.  If you are happy with log4j, you’ll want a log4j.xml file unless you already have it on your classpath.  If you don’t, you can configure https://github.com/plutext/docx4j/blob/master/src/samples/_resources/log4j.xml to suit.
If you want to use something other than log4j for logging, well you can, since docx4j uses slf4j.
First you need to exclude the log4j stuff.

		<dependency>
			<groupId>org.docx4j</groupId>
			<artifactId>docx4j-ImportXHTML</artifactId>
			<version>3.0.0</version>
			<exclusions>
				<exclusion>
					  <groupId>org.slf4j</groupId>
					  <artifactId>slf4j-log4j12</artifactId>
				</exclusion>
				<exclusion>
					<groupId>log4j</groupId>
					<artifactId>log4j</artifactId>
				</exclusion>
			</exclusions>
		</dependency>

Then you add in the dependencies for your other logging frameworks.   See further http://www.slf4j.org/ and slf4j in search.maven.org
JAXB
docx4j relies very heavily on JAXB.  With Java 6 or 7, usually it’ll use the JAXB included in that (though things can be different with application servers – see the deployment forums for details).
The point here is that there is an alternative JAXB implementation, called EclipseLink MOXy (see http://www.eclipse.org/eclipselink/moxy.php), which is very well supported by its developers.  You can try it with docx4j.  To do so, just include the following additional dependencies:
[/sourcecode]

org.docx4j
docx4j-MOXy-JAXBContext
3.0.0


org.eclipse.persistence
org.eclipse.persistence.moxy
2.5.1

/sourcecode]

Since using MOXy with docx4j is all quite new, you may run into some minor issues.  If you do, please let us know in the docx4j forums (with sufficient info for us to reproduce what you are seeing!).  Thanks.

Nov 26 2013

docx4j 3.0 released

On behalf of everyone who has contributed to docx4j, Plutext is pleased to announce that version 3 was released today.

You can get it from Maven Central, or from http://www.docx4java.org/docx4j/ (the jar, the dependencies, or everything including documentation zipped up)

Source code is available at GitHub or from the Maven Central link above.  Javadoc is at Maven Central.

For what you need to know about docx4j 3.0, please see this post.

The XHTML Import stuff is now a separate project (since it and its dependencies are LGPL, not ASLv2 like docx4j).

  • the three jars you need (docx4j-ImportXHTML, xhtmlrenderer, and iText) are included for convenience in the zip file above.  You can delete them if you don’t need or want XHTML import.
  • or you can get it from Maven Central

docx4j 3.0 uses slf4j for logging.  For convenience, log4j is the default implementation.  A follow-up post will explain more about logging config.

Thanks to everyone who has helped to make this release our best yet!

If you have questions pertaining to the use of docx4j, please post them in our forum, or on StackOverflow (rather than in comments to this post).

Nov 07 2013

docx4j 3.0 beta

A beta of docx4j 3.0 is now available, at:

http://www.docx4java.org/docx4j/docx4j-3_0-beta2.zip [link updated 15 Nov]

That zip file contains docx4j, and all its dependencies.  To use it, add all the jars to your classpath.

Alternatively, Maven users can get the beta from our staging repo on GitHub.

<repositories>
    <repository>
        <id>docx4j-mvn-repo</id>
        <url>https://raw.github.com/plutext/docx4j/mvn-repo/</url>
        <snapshots>
            <enabled>true</enabled>
            <updatePolicy>always</updatePolicy>
        </snapshots>
    </repository>
</repositories>

docx4j 3.0 beta is:


<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j</artifactId>
<version>3.0.0-SNAPSHOT</version>
</dependency>

Our last blog post outlines the major things to be aware of in v3.

Additional notes:

  • For convenience, the zip file also contains docx4j-ImportXHTML, and its dependencies, which are LGPL.  You can delete these if you wish.  They aren’t in the mvn staging repo.
  • To see any logging, you’ll need to add an slf4j implementation.
  • You might want to add a docx4j.properties file

You can find updated Getting Started guide in docx|pdf formats at http://www.docx4java.org/docx4j/.

Feedback welcome.  You can reply here, or to the post in the docx4j forums.

All going smoothly, we’ll progress to final release over the next couple of weeks, so the sooner your feedback, the better!

Oct 18 2013

docx4j 3.0 – what you need to know

docx4j 3.0 (beta for which will be available shortly) contains a lot of changes, some big, some small.

Here are the most visible (see our changelog for the rest):

Logging

docx4j 3.0 uses slf4j, instead of log4j.

As the slf4j website puts it:

The Simple Logging Facade for Java (SLF4J) serves as a simple facade or abstraction for various logging frameworks (e.g. java.util.logging, logback, log4j) allowing the end user to plug in the desired logging framework at deployment time.

So you need the slf4j api jar on your classpath:

<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.5</version>
</dependency>

If you want to use log4j, then include it, and:

<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.5</version>
</dependency>

XHTML Import

The XHTML Import functionality is now a separate project on GitHub.

The reason being that its main dependency – Flying Saucer - is licensed under LGPL v2.1 (as opposed to ASL v2, which docx4j’s other dependencies use).

If you want this functionality, you have to add these jars to your classpath.  We’ll update this post with their coordinates once they are in Maven Central.

Docx4j facade

3.0 contains a facade providing clean access to some typical uses of docx4j:
  • Loading a document
  • Saving a document
  • Binding xml to content controls in a document
  • Exporting the document (to HTML, or PDF and other formats supported by the FO renderer)

You don’t have to use this – in that existing code should continue to work – but the facade is the right way to do things.  Behind the facade is a major rethink/cleanup to the export architecture/implementation, contributed by Alberto.

MOXy

The key technology underlying docx4j – and a major differentiator from Apache POI – is JAXB.

There is a JAXB reference implementation; the JAXB baked into Java 6 and 7 is based on it.

Prior to v3, you had to use the reference implementation, or the implementation included in the JDK.

With v3, you can choose to use EclipseLink MOXy instead.  To do so, simply include docx4j-MOXy-JAXBContext-3.0.0.jar and the MOXy jars on your classpath.

Sample code

The docx4j samples have relocated to src/samples