Feb 10 2015

High fidelity PDF output

This post introduces our new commercial component for docx to PDF output.

The background is that docx4j’s standard method of producing PDF output has been via XSL FO, using Apache FOP.

This has worked well enough for some docx4j users, but it has certain limitations which can bite you, for example lack of tab and tab stop support in XSL FO.

And because there are differences between FOP’s layout engine and Word’s, page breaks may fall in different places.

This means the FO based PDF output in docx4j is about as good as its going to get (short of enhancing the FO renderer).

To do better, we’ve had to invest in a non-FO approach, using layout algorithms specifically designed to give the same results Word does.

You can try it now.

A side benefit is that this new approach is much faster than the FO approach.

The component is actually independent of docx4j.  This means it’ll also work great  if you need to convert docx to PDF from C# (without Word), Python, PHP etc.

Pricing is at plutext.com

Jan 20 2015

Content controls for business data connectivity

Sometimes, Word is a natural way for people to interact with back end applications (eg SAP).

This is particularly so when:

  • business data will be output to a Word document,
  • the user is more familiar with Word than the other system,
  • certain data updates may be required (and are permitted)

So maybe there are 4 high level categories:

  • apps which support commercial transactions (a recipient will receive a Word document), eg
    • employment onboarding (letter of employment)
    • invoicing
  • apps with a reporting component:  is the report format a natural interface if it was made bi-directional?
  • workflow/BPM systems which present documents (work orders, proposals, approvals etc)
  • policy/procedures in regulated industries, where a worker must follow a series of steps.  Can present that in a docx; the worker can tick the steps off as they do them
    • related training scenarios?

Microsoft had an emphasis on what they then called “Office Business Applications” back around the Office 2007 launch.  Fast forward to today, and “business connectivity services” are part of SharePoint.

But you can achieve the same sort of thing without SharePoint, using docx4j and data bound content controls.

Once you have your back end data in an XML format (and there are many tools/techniques to help with this), you can use content controls to bind what the user sees in Word to elements in that XML.

The beauty of it is that the binding is bi-directional, so if the user edits the document, the XML is updated (ie it stays in synch).

After the user has made their edit, you can update the back end application.  Typically, you’d do this after they saved & closed the document (ie outside Word, using docx4j), but you could also do it from within Word (a less good approach, but still, an option).

What if there is some data which the user shouldn’t be able to edit?  You simply lock the content control to prevent editing.

To quickly try out this approach, put together some sample XML, then upload it as explained here, to get a docx you can experiment with.

We’d loved to hear about how you might use this approach?

Jan 20 2015

I have my XML, now what?

A barrier to using  content control data binding has been getting a feel for what the solution might look like.  You have your XML data, but what do you do next?

The audience for this post is broader than docx4j users; its for anyone wanting to easily set up a template docx using XML data binding, just to get a feel for what it is like from an authoring/Word perspective.

How do you add content controls to your Word document, and map them to the XML data?

For docx4j purposes, you use an OpenDoPE Word AddIn (of which there are two you can try).

Alternatively, Word 2013 introduced the XML mapping task pane (and some additional file format features) .  The docx linked in my previous blog post highlighted some usability issues with the XML mapping pane.

This post presents a “fast start” way you can try.  Simply upload your XML file here, and it will give you back a docx, with content controls mapped to that sample XML.

To take the following XML for an invoice as an example:

You’d get back the following Word 2013 docx (note that Word 2010 silently strips the repeats without warning):

In “design mode”, it looks like:

Whether you’re in design mode or not, you can edit the content of a content control, then to satisfy yourself that things are working correctly, save the docx, then unzip it and see your altered data in /customxml/item1.xml

This idea is that once you have the basic docx, you can quickly/easily edit it in Word to make it prettier:

The tool recognises and uses the following content control types:

  • repeats: automatically detect repeats and set up suitable content control structures (we use a table if there’d be more than one column) fully populated with the XML data
    • note that these are Word 2013 repeatingItem structures, not OpenDoPE repeats.  (We recommend you use OpenDoPE repeats, but the tool creates Word 2013 repeatingItems, so you can see what Word 2013 users get out of the box).
  • pictures/images: will use a picture content control if the field contains base64 encoded data, or the string PICTURE or string IMAGE
  • checkbox: generated if value is true or false
  • escaped Flat OPC XML: a bound rich text content control will be inserted if the element contains the string FLAT-OPC or the string   ?mso-application progid=”Word.Document
  • date control: used if the element content is the string DATE

Have fun.

Jan 17 2015

Word 2013 repeatingSection content controls – ready for prime time?

For developers wondering about Microsoft’s commitment to content controls, Office 2013 was certainly good news.

In Microsoft’s “What’s new for Word 2013 developers”, 2 of the 4 items were about content controls:

  • Enhancements to content controls
  • UI for XML mappings

And in http://blogs.office.com/2012/10/17/top-5-reasons-developers-will-love-word-2013/ 4 of the 5 reasons relate to content controls!

The MSDN article “What’s new with content controls in Word 2013” describes the changes in detail, but one of them was the introduction of repeating section content controls, which are comparable to OpenDoPE repeats (which docx4j has supported for ages).

The question is whether the time is ripe to migrate from OpenDoPE repeats to Word 2013 repeatingSection content controls?

My suggested answer is “no, or at least, not yet”, because

  1. Word 2010 strips out Word 2013 repeating content controls, and does so without warning!  Compare OpenDoPE repeats, which work in Word 2007, 2010 and 2013.  So until Word 2010 becomes irrelevant (or support is back ported), Word 2013 repeating content controls can’t be used in a generic solution.
  2. Word 2013 doesn’t handle the case of repeat zero times as you’d expect; it leaves a single instance, which will cause problems in many applications.

For authoring, the XML Mapping Pane in Word 2013 also leaves a bit to be desired.  For more details, please see w15RepeatingSection_cf_OpenDoPE.docx

Even so, docx4j 3.2.2 will support processing Word 2013 repeating content controls, for those who still choose to use them.

Dec 04 2014

Docx4jHelper Word AddIn

The dream:

  • View Open XML right from within Word, and see what happens when you edit it.
  • Or generate corresponding docx4j Java code, with deep links into the corresponding docx4j source code and Open XML spec.

Regular users of docx4j will be aware of our webapp, which amongst other things, generates docx4j Java code for the specified Open XML in your sample docx/pptx/xlsx.

The webapp is useful, but it has a few draw backs:

  • you have to upload your docx/pptx/xlsx, which takes time
  • if your docx/pptx/xlsx contains sensitive data, you probably want to remove that first
  • the webapp might be down

To address these issues, we’re now offering the code gen functionality as a Word AddIn.

If you install the Word AddIn, this means you can now generate code without your docx leaving your computer.

This is all feasible because docx4j can run as a DLL in a .NET project, thanks to IKVM!

Where to get it

You can download the installer.  After you complete the landing form (using your corporate email address, not gmail etc), you’ll be sent a download link.

Getting Started

After a successful installation, after restarting Word, you should see a “Docx4j” menu, containing:

To generate code, first press the “Load Helper” button.

You’ll see the following form:

Its inviting you to start a local web server which will run the same code as the existing webapp.  Just choose a port you aren’t already using.  If for some reason you want to browse using Internet Explorer (as opposed to your default browser), check the box.

It’ll take a little while to start the server; you’ll see a dialog when its started.

Now you can generate code.  To do so, select something in your docx, then click the “Generate Code” button.

After a while, a window will open in your web browser, and you’ll see:

That’s the view of the docx package, which will be familiar if you’ve used the webapp.   For how to generate code from here, see our earlier post.

Code generating is done on your computer.  (But note, the links on that page to docx4j source code and the OpenXML spec are external links)

What about the “Edit OpenXML” button?

If you select something in your docx, then click that button, after a while (maybe 30 secs the first time!), you’ll see the corresponding XML in an editor window:

You can go ahead and edit it, then click the “Apply” button.

If Word likes your XML, you’ll see your changes on the document surface.  Ctrl Z should work for undo.

So there are 2 ways to see the underlying XML

The first way we described uses your web browser; the second is a Windows Form.

These two views have different features; maybe a later release will unify them?

What about pptx, xlsx?

There’s no reason in principle we couldn’t make a similar AddIn for Powerpoint and Excel.  In fact, we plan to make these, once any teething issues have been ironed out in the WordAddIn.

In the meantime, for pptx and xlsx, you can continue to use the webapp.

Help, Suggestions and other Discussion

If you are a Plutext customer experiencing an issue, please email support@plutext.com

Otherwise, please check the Docx4jHelper AddIn forum.

We’ve got some ideas for where the AddIn goes from here, but we’d love to hear yours.

Oct 04 2014

Web-based docx editing?

Following on from the previous post on content tracking, some people have been asking about how to edit a docx in a web browser.

So I thought I’d link to a proof of concept we did a year or so ago.

The idea is:

  • use docx4j to convert the docx to XHTML
  • use CKEditor to edit that XHTML in the web browser
  • on submit, convert the XHTML back to docx content

The general problem with converting to/from XHTML is the “impendance mismatch”.  That is, losing stuff during round trip.  This will be a familiar problem to anyone who has ever edited a docx in Google Docs or LibreOffice.

This demo addresses that problem by identifying docx content which CKEditor would mangle, and then on submit/save, using the original docx content for those bits.

In this demo, the problematic content is replaced with visual placeholders, so you can see it is there.

The intent is that you can add/edit text content in the browser, without other document content (headers/footers, text boxes etc) getting lost.

To give it a try, go to the upload page and choose a docx file from your computer

You should see your docx open with the CKEditor toolbars above it:

(In the demo and screenshot above, the grey “B” image represents a bookmark)

Make some edits, then hit the Submit button (at the bottom).

The docx will be streamed back to your computer as a download in your browser.

Now open it in Word, and compare it to the original.

Feedback

If you want to add this type of functionality to your application, please let us know by emailing jharrop@plutext.com

We’d love to hear:

  • a bit more about your use case,
  • where you see your users doing their web-based editing:- on your intranet, extranet, or the web at large?
  • what kind of editing? is it proof reading,  customising particular sections, a step in a workflow..?
  • do you need to cater for iPads or Android tablets?  And if so, is a dedicated app on your roadmap?
  • any additional requirements you might have!

Sep 08 2014

XHTML-docx roundtrip: content tracking

There are a couple of common use cases for docx4j’s XHTML import capability:

The first is enabling a webapp with HTML reporting to output/export reports in Word’s docx format.  With docx4j, you can get really nice results doing this, especially if your XHTML has @class which map to Word styles.

The second – to support web based editing – is the subject of this post.  In a full incarnation, the vision is:

  • be able to edit the content in Word or in the web browser (using an XHTML editor such as CKEditor)
  • track chunks of content, perhaps for workflow/approval processes, version control, or re-use

docx4j can help you with this vision in a Java or .NET (eg C#) environment.

Web based XHTML editing is well understood, so here I’ll focus on tracking chunks of content.

In XHTML, its straightforward.  You can add div elements (eg <div id=”contentXYZ”>) to your heart’s content.  And you can nest them (think book, chapter, section, sub-section).

How to track that ID to or from docx format?

The answer: content controls.

Bookmarks are another possibility, but I wouldn’t recommend them for this purpose, because it is easy for a user to delete them, or inadvertently insert extra bookmarks.  They lack the rich features of content controls (eg locking), and aren’t very “XMLy” (they are pairs of start and end point tags which create additional challenges).

So, back to content controls.

Content controls are analogous to divs.  They have IDs; you can nest them; etc.

Content controls aside, the docx file format is flat.  Its a sequence of paragraphs and tables.  Its only inside tables that paragraphs also appear (and nested tables).

So, all we need to do convert divs to content controls, and vice versa.

This post tells you how to do that with docx4j.

XHTML to docx (div to content control)

For XHTML to docx, you use docx4j-ImportXHTML

div to content control support was added after 3.2.0′s release, in this commit.  So for now, you need to build from source, or use a nightly build.

Once you have that, to use it, do something like:

XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
XHTMLImporter.setDivHandler(new DivToSdt());

That implementation will convert div elements to content controls, and place @id and @class values into the content control’s w:tag, for example “class=class1&id=myid”

You can extend DivToSdt with any extra functionality/logic you might require, such as locking the content control for editing/deletion.

docx to XHTML (content control to div)

The content control to div functionality has been present for a lot longer.

For that, you use docx4j to generate XHTML output in the usual way, but first you invoke SdtWriter.registerTagHandler

See the sample DivRoundtrip.java for a fully worked example of divs to content controls, then back to divs again.

The tag handler concept is to treat the content of the w:tag like an HTTP query string (key value pairs).

A tag handler is registered for a specific key (eg ‘id’, ‘class’) or the wildcards (‘*’, ‘**’), and will only execute if the key is found in the w:tag.

For this example, we want our tag handler to insert a div depending on both class and id keys, so we register it as ‘*’ (we don’t want 2 handlers, which might result in 2 divs).

A tag handler with double asterisk ‘**’ will always be applied if you need that.  See the SdtWriter source code for definitive behaviour.

Sep 05 2014

C#/.NET: Import XHTML into docx without Word

How to convert import HTML into a Word document without using Microsoft Word?

Honouring the CSS, so the Word document looks similar to the input XHTML.  Alternatively, converting @class values to Word styles.

Its a common requirement in our increasingly web-centric world.

docx4j-ImportXHTML.NET is open source (LGPL v2.1 or later), identical to the Java version, but made into a DLL using IKVM.  Currently we’re at v3.2.0, released last week.

It is easy to test; with very little effort, you can run it from a sample project in Visual Studio.  Its very easy, because docx4j-ImportXHTML.NET is in the NuGet.org repository:

To create your sample project:

  1. make sure you have NuGet Package Manager installed
    • for VS 2012 and later, its installed by default
    • for VS 2010, NuGet is available through the Visual Studio Extension Manager; see the above link.
  2. create a new project in Visual Studio (File > New > Project).  A Console Application is fine.  I chose that from the .NET 3.5 list.
  3. from the Tools menu, choose NuGet Package Manager > Package Manager Console
  4. type Install-Package docx4j-ImportXHTML.NET

You should see something like:

And then, your project/solution will be populated to look like:

We’re nearly there!  Notice the docx4j-ImportXHTML DLL, and the file src/samples/c_sharp/docx/ConvertInXHTMLFragment.cs.  Most of the rest of the stuff comes from the docx4j dependency, which NuGet fetches.

If you have a look at ConvertInXHTMLFragment.cs, you’ll see it contains

Let’s run it, to convert that xhtml to docx content.

Click on your project in Solution Explorer, then right click (or hit Alt+Enter) to get the properties pane:

Then set the “startup object” as shown in the above image.

Now you can hit Ctrl+F5 (“Start without Debugging”) – you don’t want to debug, since that’s really slow.

You should see some logging in the console window, culminating in something like:

You can see there the WordML equivalent for the tail of the XHTML list we were converting.

Obviously, you can modify src/samples/c_sharp/Docx4NET/DocxToPDF.cs to read your own XHTML.

A few comments.

Well formed XML! Only well formed XML works, ie XHTML, not tag-soup HTML.  If you have tag soup, its your responsibility to convert that to XHTML with some tidy tool.   You’ll get a SAXParseException if your input is not well formed.

Word styles: if the target docx contains a style matching @class, it can be used.  This’ll be the subject of a separate blog post.

Other examples: the Java repository on GitHub contains examples for reading from a file etc.  Converting these to C# is left as an exercise for the reader.  If you do that, we’d be delighted to receive a pull request on https://github.com/plutext/docx4j-ImportXHTML.NET

Logging, Commons Logging. Logging is via Commons Logging.  In the demo, it is configured programmatically (ie in  DocxToPDF.cs).  Alternatively, you could do it in app.config.

OpenXML SDK interop: src/main/c_sharp/Plutext/Docx4NET contains code for converting between a docx4j representation of a docx package, and the Open XML SDK’s representation.

Improving XHTML import support. To implement a new feature in the XHTML import, typically you’d make the improvement to docx4j-ImportXHTML first (ie the Java version), then create a new DLL using the ant build target dist.NET.   docx4j-ImportXHTML is on GitHub, and is most easily setup using Maven (see earlier blog post).

Alternatives. There are a couple of projects on CodePlex you could try:

I’d be interested in feedback on how they compare.

Help/support/discussion. You can post in the docx4j XHTML import forum, or on StackOverflow (be sure to use tag docx4j, plus some/all of c#, docx, xhtml etc as you think appropriate).  Please don’t cross post at both!


Sep 05 2014

docx to PDF in C#/.NET

How to convert docx to PDF without using Microsoft Word?

If you docx is mainly text, tables and images, docx4j.NET may work well for you.  Edit (Feb 2015): if not, you may be interested in our new commercial high fidelity PDF renderer.

docx4j.NET is open source (Apache software license v2), identical to the Java version, but made into a DLL using IKVM.  Currently we’re at v3.2.0, released last week.

It is easy to test; you can upload your docx to the docx4j demo webapp

Or with very little effort, you can run it from a sample project in Visual Studio.  Its very easy, because docx4j.NET is in the NuGet.org repository:

To create your sample project:

  1. make sure you have NuGet Package Manager installed
    • for VS 2012 and later, its installed by default
    • for VS 2010, NuGet is available through the Visual Studio Extension Manager; see the above link.
  2. create a new project in Visual Studio (File > New > Project).  A Console Application is fine.  I chose that from the .NET 3.5 list.
  3. from the Tools menu, choose NuGet Package Manager > Package Manager Console
  4. type Install-Package docx4j.NET

You should see something like:

And then, your project/solution will be populated to look like:

We’re nearly there!  Notice the file src/samples/c_sharp/Docx4NET/DocxToPDF.cs

Click on your project in Solution Explorer, then right click (or hit Alt+Enter) to get the properties pane:

Then set the “startup object” as shown in the above image.

Now you can hit Ctrl+F5 (“Start without Debugging”) – you don’t want to debug, since that’s really slow.

You should see some logging in the console window, culminating in “done! Press any key to continue..”

What just happened?  All being well, the sample docx “src\samples\resources\sample-docx.docx” was saved as a PDF “OUT_sample-docx.pdf” in your project directory.

You can modify src/samples/c_sharp/Docx4NET/DocxToPDF.cs to read your own test docx.

A few comments.

XSL FO; Apache FOP. docx4j creates PDF via XSL FO.  It generates XSL FO, then uses Apache FOP (v1.1) to convert the XSL FO to PDF.  FOP also supports other output formats (the subject of another blog post).

Logging, Commons Logging. Logging is via Commons Logging.  In the demo, it is configured programmatically (ie in  DocxToPDF.cs).  Alternatively, you could do it in app.config.

OpenXML SDK interop: src/main/c_sharp/Plutext/Docx4NET contains code for converting between a docx4j representation of a docx package, and the Open XML SDK’s representation.

Improving PDF support. To improve the quality of the PDF output, typically you’d make the improvement to docx4j first (ie the Java version), then create a new DLL using the ant build target dist.NET.   docx4j is on GitHub, and is most easily setup using Maven (see earlier blog post).

Help/support/discussion. You can post in the docx4j PDF output forum, or on StackOverflow (be sure to use tag docx4j, plus some/all of c#, docx, pdf, fop, xslfo as you think appropriate).  Please don’t cross post at both!


May 12 2014

SQL Server Reporting Services (SSRS) emits dodgy Word docx documents

By now we’re used to products which emit docx files which are umm, not .. quite .. right.

But its more noteworthy when the product in question is from Microsoft.  After all, its their file format (ECMA etc standardisation notwithstanding).

The product in question here is SQL Server Reporting Services 2012 and its Word export.

It seems they didn’t bother to validate their documents (eg using Open XML SDK 2.0 Productivity Tool):

Apparently there’s a reason for this:

“Word and SSRS treat page headers and footers differently. Word actually positions them inside the page margins, whereas SSRS positions them inside the area that the margins surround. As a result, in Word, the page margins do not control the distance between the top edge of the page and that of the page header (or similarly for the page footer). Instead, Word has separate “Header from Top” and “Footer from Bottom” properties to control those distances. Since RDL does not have equivalent properties, the Word renderer sets these properties to zero.”
But the problem is that it is actually setting them to blank (as opposed to zero), which is not valid.

Another problem:

JAXB doesn’t like invalid documents, so docx4j has to fix these sorts of things before it can construct a content model.  (Maybe that’s why SSRS calls it Word export, not docx export:- they just check Word can open the document, then call it job done)

There are other problems with SSRS docx which the Productivity Tool doesn’t report.

Take a look at the styles part:

Notice anything wrong?  It’d be better if the EmptyCellLayoutStyle had @w:styleId and @w:type, like so:

It’d also be nice if it defined the “Normal” style it is basedOn!

docx4j and other consumers could/should detect such problems and degrade gracefully in the face of them, but Microsoft (of all companies!) should exercise better quality control.