Sep 05 2014

C#/.NET: Import XHTML into docx without Word

How to convert import HTML into a Word document without using Microsoft Word?

Honouring the CSS, so the Word document looks similar to the input XHTML.  Alternatively, converting @class values to Word styles.

Its a common requirement in our increasingly web-centric world.

docx4j-ImportXHTML.NET is open source (LGPL v2.1 or later), identical to the Java version, but made into a DLL using IKVM.  Currently we’re at v3.2.0, released last week.

It is easy to test; with very little effort, you can run it from a sample project in Visual Studio.  Its very easy, because docx4j-ImportXHTML.NET is in the repository:

To create your sample project:

  1. make sure you have NuGet Package Manager installed
    • for VS 2012 and later, its installed by default
    • for VS 2010, NuGet is available through the Visual Studio Extension Manager; see the above link.
  2. create a new project in Visual Studio (File > New > Project).  A Console Application is fine.  I chose that from the .NET 3.5 list.
  3. from the Tools menu, choose NuGet Package Manager > Package Manager Console
  4. type Install-Package docx4j-ImportXHTML.NET

You should see something like:

And then, your project/solution will be populated to look like:

We’re nearly there!  Notice the docx4j-ImportXHTML DLL, and the file src/samples/c_sharp/docx/ConvertInXHTMLFragment.cs.  Most of the rest of the stuff comes from the docx4j dependency, which NuGet fetches.

If you have a look at ConvertInXHTMLFragment.cs, you’ll see it contains

Let’s run it, to convert that xhtml to docx content.

Click on your project in Solution Explorer, then right click (or hit Alt+Enter) to get the properties pane:

Then set the “startup object” as shown in the above image.

Now you can hit Ctrl+F5 (“Start without Debugging”) – you don’t want to debug, since that’s really slow.

You should see some logging in the console window, culminating in something like:

You can see there the WordML equivalent for the tail of the XHTML list we were converting.

Obviously, you can modify src/samples/c_sharp/Docx4NET/DocxToPDF.cs to read your own XHTML.

A few comments.

Well formed XML! Only well formed XML works, ie XHTML, not tag-soup HTML.  If you have tag soup, its your responsibility to convert that to XHTML with some tidy tool.   You’ll get a SAXParseException if your input is not well formed.

Word styles: if the target docx contains a style matching @class, it can be used.  This’ll be the subject of a separate blog post.

Other examples: the Java repository on GitHub contains examples for reading from a file etc.  Converting these to C# is left as an exercise for the reader.  If you do that, we’d be delighted to receive a pull request on

Logging, Commons Logging. Logging is via Commons Logging.  In the demo, it is configured programmatically (ie in  DocxToPDF.cs).  Alternatively, you could do it in app.config.

OpenXML SDK interop: src/main/c_sharp/Plutext/Docx4NET contains code for converting between a docx4j representation of a docx package, and the Open XML SDK’s representation.

Improving XHTML import support. To implement a new feature in the XHTML import, typically you’d make the improvement to docx4j-ImportXHTML first (ie the Java version), then create a new DLL using the ant build target dist.NET.   docx4j-ImportXHTML is on GitHub, and is most easily setup using Maven (see earlier blog post).

Alternatives. There are a couple of projects on CodePlex you could try:

I’d be interested in feedback on how they compare.

Help/support/discussion. You can post in the docx4j XHTML import forum, or on StackOverflow (be sure to use tag docx4j, plus some/all of c#, docx, xhtml etc as you think appropriate).  Please don’t cross post at both!

No Responses so far

Comments are closed.

Comment RSS