Archive for the ‘Uncategorized’ Category

Aspose.Confusion in Words

September 26th, 2015 by Jason

In June, Aspose’s Shoaib Khan published a blog post purporting to cover features available in Aspose.Words for Java but not docx4j.

It is either breathtaking or amusing in its inaccuracy, depending on whether you think it was born of deceit or ineptitude.  Either way, its a caution to anyone considering drinking the Aspose.Kool-Aid!

Here I’ll go through his claims one by one.

As a general comment though, it is worth remembering that with docx4j, you can do pretty much anything the docx/pptx/xlsx file formats allow.  If docx4j doesn’t have a high level API for something you want to do, you can always implement it yourself, thanks to docx4j’s lower level JAXB-based APIs.  And docx4j is real ASLv2 open source, so you can use the source Luke!

Without further ado…

Set Page Borders.
Here Aspose seems to be talking about section properties (ie margins etc).

Shoaib implies you can’t control these in docx4j.  Of course you can!  You can add or remove sections, or modify the settings of an existing section.

Track Changes in Documents.
Their AcceptAllRevisions method is said to be similar to Word’s “Accept All Changes”.

Docx4j doesn’t provide a high level API for doing this (since users haven’t asked for it), but a user could implement this for his/herself easily enough using XSLT or docx4j’s TraversalUtil.  You could start with this XSLT

Using Control Characters.
This example is a bit bizarre, because in a docx, specific elements w:br and w:cr are used for line breaks;  OpenXML follows the usual XML rules for whitespace.

Split Tables.
This example shows the steps a user would follow to split a table into two.  Basically, clone the existing table to make a new table with the same properties, then move rows from the first table to the second.

Of course you can do the same thing in docx4j!

Repeat Table Header Rows on Pages.
This example is just about setting the header row property.  See http://stackoverflow.com/questions/14605264/repeat-table-header

Clone Documents.
Cloning a document.  Aspose suggest you can’t do this with docx4j? WTF?! OpcPackage’s clone() method has been there since 2.7.1

docx4j also includes code for making partial copies, where less than a full clone is required.

Moving the Cursor in Document.
OpenXML is of course XML, which is hierarchical.  docx4j uses JAXB to give you an object model representation of that.  The hierarchical structure is basically nested lists.  Lots of stuff boils down to finding a position in a list, and then inserting or deleting etc using the Java collections API.  To find that list/position, you’d typically use docx4j’s powerful traversal functionality.   Or you can use XPath (in JAXB, the objects are bound to the underlying XML).

Aspose has some notion of cursor position, so you can move to the start or end of the document.  This may appeal to people with a VBA background, but in practice it is of little use.

Protect Documents.
Whilst it is true that in docx4j 3.2.0 there was no high level API for the functionality Microsoft Word groups under Protect Document (Mark as Final, Encrypt with Password, Restrict Editing etc), this is available now in the 3.3.0 previews:

Working with Digital Signatures.
As with the other Protect Document features, there was no high level API in 3.2.0.  That’s not to say you couldn’t do it, but there’s a nice API for this in the commercial Enterprise Ed. (forthcoming v3.3.0)

Check Format Compatibility.
This seems to be restricted to knowing what type of document you are working with; Aspose says it doesn’t validate the file format.

Per the specs, an OPC package has a content type.  In docx4j, docx/dotx/docm etc are all represented by WordprocessingMLPackage, but you can distinguish between them by calling getContentType(); the value will be one of:

  • WORDPROCESSINGML_DOCUMENT
  • WORDPROCESSINGML_DOCUMENT_MACROENABLED
  • WORDPROCESSINGML_TEMPLATE
  • WORDPROCESSINGML_TEMPLATE_MACROENABLED

Load Text File.
Whoopey do.  Apparently you can import plain text using Aspose’s expensive software!

Of course that is trivial with docx4j.  Docx4j also supports converting altChunks to native WordML content.  For XHTML altChunks, you need docx4j-ImportXHTML;  support for altChunks of type docx is an Enterprise level feature.

Specify Default Fonts.
The way the default font works in WordprocessingML is more complicated than you might expect, in that there are a few different ways you could affect it (via the theme part or the styles part).

That said, in real life, this doesn’t tend to be a problem.  With docx4j, you can easily set  w:docDefaults/w:rPrDefault/w:rPr/w:rFonts in your styles part, if you want to.

Working with Tables:
Autofit Setting to Tables.
This is just about the tblLayout setting: w:tblPr/w:tblLayout/@w:type, which you can access via TblPr’s get/setTblLayout

Joining Tables in Document.
I can’t recall anyone ever asking for docx4j to provide a high level API to do this, but it could be added.  In the meantime, docx4j allows to you to do anything with tables which the file format allows, including joining tables.

Mail Merge
Mail Merge from XML Data Source.
docx4j provides a high level API for working with legacy MERGEFIELD fields.

If you wanted to fill those fields with data from XML, you could do that easily enough.

Where docx4j really shines though, is in its support for content control data binding.  In that approach, introduced by Microsoft in 2007, you have a bidirectional XPath mapping between content controls in the document, and an XML file.

If you are working with XML, and not forced to work with legacy MERGEFIELDs for some reason, content control data binding is the way to go.

Off topic: Eclipse’s maven from a command line?

June 16th, 2015 by Jason

You’ve installed Eclipse.

Eclipse includes maven (m2e).

Can you use that Maven from outside Eclipse, or do you need to install maven again/separately?

It turns out you can use it.  Whether its worth the effort or not is another question…

You launch maven using the plexus classworlds launcher.

That needs a config file.

The config file (call it m2.conf) contains something like:

main is org.apache.maven.cli.MavenCli from plexus.core

set maven.home default /home/jason/.m2

[plexus.core]
load /home/jason/eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/*.jar

With that, from a project dir, the following is the equivalent of ‘mvn install’:

java -cp ../eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/plexus-classworlds-2.5.1.jar:../eclipse/plugins/org.slf4j.api_1.7.2.v20121108-1250.jar  “-Dclassworlds.conf=m2.conf” org.codehaus.plexus.classworlds.launcher.Launcher install

You could make a shell script to do that.  And you could base the shell script on the one included in the maven distribution.  Or more sensibly, you’d just download  install and use  maven proper.

If you need it.  Right clicking on your project in Eclipse then Run As gives you a handy UI for maven-related stuff:

Finally, note it is  possible to avoid the plexus classworlds launcher by invoking MavenCli directly:

java -cp “../eclipse/plugins/org.slf4j.api_1.7.2.v20121108-1250.jar:../eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/*” org.apache.maven.cli.MavenCli install

If you are going to use that, you might wonder about setting maven.home with -Dmaven.home=/your/path/to/where immediately after the classpath

Content controls for business data connectivity

January 20th, 2015 by Jason

Sometimes, Word is a natural way for people to interact with back end applications (eg SAP).

This is particularly so when:

  • business data will be output to a Word document,
  • the user is more familiar with Word than the other system,
  • certain data updates may be required (and are permitted)

So maybe there are 4 high level categories:

  • apps which support commercial transactions (a recipient will receive a Word document), eg
    • employment onboarding (letter of employment)
    • invoicing
  • apps with a reporting component:  is the report format a natural interface if it was made bi-directional?
  • workflow/BPM systems which present documents (work orders, proposals, approvals etc)
  • policy/procedures in regulated industries, where a worker must follow a series of steps.  Can present that in a docx; the worker can tick the steps off as they do them
    • related training scenarios?

Microsoft had an emphasis on what they then called “Office Business Applications” back around the Office 2007 launch.  Fast forward to today, and “business connectivity services” are part of SharePoint.

But you can achieve the same sort of thing without SharePoint, using docx4j and data bound content controls.

Once you have your back end data in an XML format (and there are many tools/techniques to help with this), you can use content controls to bind what the user sees in Word to elements in that XML.

The beauty of it is that the binding is bi-directional, so if the user edits the document, the XML is updated (ie it stays in synch).

After the user has made their edit, you can update the back end application.  Typically, you’d do this after they saved & closed the document (ie outside Word, using docx4j), but you could also do it from within Word (a less good approach, but still, an option).

What if there is some data which the user shouldn’t be able to edit?  You simply lock the content control to prevent editing.

To quickly try out this approach, put together some sample XML, then upload it as explained here, to get a docx you can experiment with.

We’d loved to hear about how you might use this approach?

I have my XML, now what?

January 20th, 2015 by Jason

A barrier to using  content control data binding has been getting a feel for what the solution might look like.  You have your XML data, but what do you do next?

The audience for this post is broader than docx4j users; its for anyone wanting to easily set up a template docx using XML data binding, just to get a feel for what it is like from an authoring/Word perspective.

How do you add content controls to your Word document, and map them to the XML data?

For docx4j purposes, you use an OpenDoPE Word AddIn (of which there are two you can try).

Alternatively, Word 2013 introduced the XML mapping task pane (and some additional file format features) .  The docx linked in my previous blog post highlighted some usability issues with the XML mapping pane.

This post presents a “fast start” way you can try.  Simply upload your XML file here, and it will give you back a docx, with content controls mapped to that sample XML.

To take the following XML for an invoice as an example:

You’d get back the following Word 2013 docx (note that Word 2010 silently strips the repeats without warning):

In “design mode”, it looks like:

Whether you’re in design mode or not, you can edit the content of a content control, then to satisfy yourself that things are working correctly, save the docx, then unzip it and see your altered data in /customxml/item1.xml

This idea is that once you have the basic docx, you can quickly/easily edit it in Word to make it prettier:

The tool recognises and uses the following content control types:

  • repeats: automatically detect repeats and set up suitable content control structures (we use a table if there’d be more than one column) fully populated with the XML data
    • note that these are Word 2013 repeatingItem structures, not OpenDoPE repeats.  (We recommend you use OpenDoPE repeats, but the tool creates Word 2013 repeatingItems, so you can see what Word 2013 users get out of the box).
  • pictures/images: will use a picture content control if the field contains base64 encoded data, or the string PICTURE or string IMAGE
  • checkbox: generated if value is true or false
  • escaped Flat OPC XML: a bound rich text content control will be inserted if the element contains the string FLAT-OPC or the string   ?mso-application progid=”Word.Document
  • date control: used if the element content is the string DATE

Have fun.

Word 2013 repeatingSection content controls – ready for prime time?

January 17th, 2015 by Jason

For developers wondering about Microsoft’s commitment to content controls, Office 2013 was certainly good news.

In Microsoft’s “What’s new for Word 2013 developers”, 2 of the 4 items were about content controls:

  • Enhancements to content controls
  • UI for XML mappings

And in http://blogs.office.com/2012/10/17/top-5-reasons-developers-will-love-word-2013/ 4 of the 5 reasons relate to content controls!

The MSDN article “What’s new with content controls in Word 2013” describes the changes in detail, but one of them was the introduction of repeating section content controls, which are comparable to OpenDoPE repeats (which docx4j has supported for ages).

The question is whether the time is ripe to migrate from OpenDoPE repeats to Word 2013 repeatingSection content controls?

My suggested answer is “no, or at least, not yet”, because

  1. Word 2010 strips out Word 2013 repeating content controls, and does so without warning!  Compare OpenDoPE repeats, which work in Word 2007, 2010 and 2013.  So until Word 2010 becomes irrelevant (or support is back ported), Word 2013 repeating content controls can’t be used in a generic solution.
  2. Word 2013 doesn’t handle the case of repeat zero times as you’d expect; it leaves a single instance, which will cause problems in many applications.

For authoring, the XML Mapping Pane in Word 2013 also leaves a bit to be desired.  For more details, please see w15RepeatingSection_cf_OpenDoPE.docx

Even so, docx4j 3.2.2 will support processing Word 2013 repeating content controls, for those who still choose to use them.

XHTML-docx roundtrip: content tracking

September 8th, 2014 by Jason

There are a couple of common use cases for docx4j’s XHTML import capability:

The first is enabling a webapp with HTML reporting to output/export reports in Word’s docx format.  With docx4j, you can get really nice results doing this, especially if your XHTML has @class which map to Word styles.

The second – to support web based editing – is the subject of this post.  In a full incarnation, the vision is:

  • be able to edit the content in Word or in the web browser (using an XHTML editor such as CKEditor)
  • track chunks of content, perhaps for workflow/approval processes, version control, or re-use

docx4j can help you with this vision in a Java or .NET (eg C#) environment.

Web based XHTML editing is well understood, so here I’ll focus on tracking chunks of content.

In XHTML, its straightforward.  You can add div elements (eg <div id=”contentXYZ”>) to your heart’s content.  And you can nest them (think book, chapter, section, sub-section).

How to track that ID to or from docx format?

The answer: content controls.

Bookmarks are another possibility, but I wouldn’t recommend them for this purpose, because it is easy for a user to delete them, or inadvertently insert extra bookmarks.  They lack the rich features of content controls (eg locking), and aren’t very “XMLy” (they are pairs of start and end point tags which create additional challenges).

So, back to content controls.

Content controls are analogous to divs.  They have IDs; you can nest them; etc.

Content controls aside, the docx file format is flat.  Its a sequence of paragraphs and tables.  Its only inside tables that paragraphs also appear (and nested tables).

So, all we need to do convert divs to content controls, and vice versa.

This post tells you how to do that with docx4j.

XHTML to docx (div to content control)

For XHTML to docx, you use docx4j-ImportXHTML

div to content control support was added after 3.2.0’s release, in this commit.  So for now, you need to build from source, or use a nightly build.

Once you have that, to use it, do something like:

XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
XHTMLImporter.setDivHandler(new DivToSdt());

That implementation will convert div elements to content controls, and place @id and @class values into the content control’s w:tag, for example “class=class1&id=myid”

You can extend DivToSdt with any extra functionality/logic you might require, such as locking the content control for editing/deletion.

docx to XHTML (content control to div)

The content control to div functionality has been present for a lot longer.

For that, you use docx4j to generate XHTML output in the usual way, but first you invoke SdtWriter.registerTagHandler

See the sample DivRoundtrip.java for a fully worked example of divs to content controls, then back to divs again.

The tag handler concept is to treat the content of the w:tag like an HTTP query string (key value pairs).

A tag handler is registered for a specific key (eg ‘id’, ‘class’) or the wildcards (‘*’, ‘**’), and will only execute if the key is found in the w:tag.

For this example, we want our tag handler to insert a div depending on both class and id keys, so we register it as ‘*’ (we don’t want 2 handlers, which might result in 2 divs).

A tag handler with double asterisk ‘**’ will always be applied if you need that.  See the SdtWriter source code for definitive behaviour.

docx to PDF in C#/.NET

September 5th, 2014 by Jason

How to convert docx to PDF without using Microsoft Word?

If you docx is mainly text, tables and images, docx4j.NET may work well for you.  Edit (Feb 2015): if not, you may be interested in our new commercial high fidelity PDF renderer.

docx4j.NET is open source (Apache software license v2), identical to the Java version, but made into a DLL using IKVM.  Currently we’re at v3.2.0, released last week.

It is easy to test; you can upload your docx to the docx4j demo webapp

Or with very little effort, you can run it from a sample project in Visual Studio.  Its very easy, because docx4j.NET is in the NuGet.org repository:

To create your sample project:

  1. make sure you have NuGet Package Manager installed
    • for VS 2012 and later, its installed by default
    • for VS 2010, NuGet is available through the Visual Studio Extension Manager; see the above link.
  2. create a new project in Visual Studio (File > New > Project).  A Console Application is fine.  I chose that from the .NET 3.5 list.
  3. from the Tools menu, choose NuGet Package Manager > Package Manager Console
  4. type Install-Package docx4j.NET

You should see something like:

And then, your project/solution will be populated to look like:

We’re nearly there!  Notice the file src/samples/c_sharp/Docx4NET/DocxToPDF.cs

Click on your project in Solution Explorer, then right click (or hit Alt+Enter) to get the properties pane:

Then set the “startup object” as shown in the above image.

Now you can hit Ctrl+F5 (“Start without Debugging”) – you don’t want to debug, since that’s really slow.

You should see some logging in the console window, culminating in “done! Press any key to continue..”

What just happened?  All being well, the sample docx “src\samples\resources\sample-docx.docx” was saved as a PDF “OUT_sample-docx.pdf” in your project directory.

You can modify src/samples/c_sharp/Docx4NET/DocxToPDF.cs to read your own test docx.

A few comments.

XSL FO; Apache FOP. docx4j creates PDF via XSL FO.  It generates XSL FO, then uses Apache FOP (v1.1) to convert the XSL FO to PDF.  FOP also supports other output formats (the subject of another blog post).

Logging, Commons Logging. Logging is via Commons Logging.  In the demo, it is configured programmatically (ie in  DocxToPDF.cs).  Alternatively, you could do it in app.config.

OpenXML SDK interop: src/main/c_sharp/Plutext/Docx4NET contains code for converting between a docx4j representation of a docx package, and the Open XML SDK’s representation.

Improving PDF support. To improve the quality of the PDF output, typically you’d make the improvement to docx4j first (ie the Java version), then create a new DLL using the ant build target dist.NET.   docx4j is on GitHub, and is most easily setup using Maven (see earlier blog post).

Help/support/discussion. You can post in the docx4j PDF output forum, or on StackOverflow (be sure to use tag docx4j, plus some/all of c#, docx, pdf, fop, xslfo as you think appropriate).  Please don’t cross post at both!


docx4j and Google Drive

March 16th, 2014 by Jason

Given the news this week about Google lowering prices per GB on Google Drive, I thought it would be timely to explore interop with docx4j.

https://github.com/plutext/docx4j-cloud-GoogleDrive is a small project which demonstrates:

Clone the project, and set it up using Maven in your IDE.  I’m not going to tell you how to do that.
Enabling the Drive API
From there, it is fairly straightforward  (assuming you have a Google account); you just need to enable the Drive API: set up a project and application in the Developers Console:
  • press the red “CREATE NEW CLIENT ID” button, then choose application type “Installed Application”; I then chose subtype “Other”
  • hit the “Download JSON” button; save it as client_secret.json in your project dir

Run our code

OK, now try running Docx4jUploadToGoogleDrive

It ought to say something like:

Please open the following URL in your browser then type the authorization code:
https://accounts.google.com/o/oauth2/auth?access_type=online&client_id=622239…

Paste the auth code into your IDE’s console (System.in, probably the same place which displayed the above message) then press enter.  If you aren’t logged into your Google account in your browser, its at this point that you’ll be asked to log in.

The code will create a new docx file, and after uploading it, if successful, report the File ID allocated by Google Drive:

File ID: 0CyHdofN18p16OF9YWWNFUFdmTjg

The other 2 samples require you to provide an auth code the same way (each time you run them).  Obviously, you’d be more sophisticated than this in a production application.  See further https://developers.google.com/drive/web/about-auth

docx4j 3.0 and Maven

November 28th, 2013 by Jason

blog/2011/10/hello-maven-central/ walks you through the basics of using docx4j in an Eclipse project with the help of m2eclipse.

This post is about the different ways you can set up docx4j 3.0 with the help of Maven.

We’ll be using the following skeleton pom.xml:


<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
	<modelVersion>4.0.0</modelVersion>
		
	<groupId>your.group</groupId>
	<artifactId>your.artifactp</artifactId>
	<name>nameless</name>
	<version>0.0.1-SNAPSHOT</version>
	<description>
		some description
	</description>


	<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
	</properties>

	<build>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-dependency-plugin</artifactId>
				<version>2.0</version>
			</plugin>
		</plugins>
	</build>
	
	<dependencies>
	
		<!-- dependencies go here -->

				 	
	</dependencies>

</project>

Adding the core dependency

To use docx4j, including its LGPL XHTML import capability, just include the following dependency in your pom.xml:


		<dependency>
			<groupId>org.docx4j</groupId>
			<artifactId>docx4j-ImportXHTML</artifactId>
			<version>3.0.0</version>
		</dependency>

That’ll drag in docx4j, and all the other dependencies (you should be able to see then in Eclipse under Maven Dependencies, or by running mvn dependency:tree at a command prompt).
If you don’t want the XHTML import stuff, just use:

		<dependency>
			<groupId>org.docx4j</groupId>
			<artifactId>docx4j</artifactId>
			<version>3.0.0</version>
		</dependency>

(You should consider adding a docx4j.properties to your classpath)
Logging
Both of the above default to using log4j.  If you are happy with log4j, you’ll want a log4j.xml file unless you already have it on your classpath.  If you don’t, you can configure https://github.com/plutext/docx4j/blob/master/src/samples/_resources/log4j.xml to suit.
If you want to use something other than log4j for logging, well you can, since docx4j uses slf4j.
First you need to exclude the log4j stuff.

		<dependency>
			<groupId>org.docx4j</groupId>
			<artifactId>docx4j-ImportXHTML</artifactId>
			<version>3.0.0</version>
			<exclusions>
				<exclusion>
					  <groupId>org.slf4j</groupId>
					  <artifactId>slf4j-log4j12</artifactId>
				</exclusion>
				<exclusion>
					<groupId>log4j</groupId>
					<artifactId>log4j</artifactId>				
				</exclusion>
			</exclusions>
		</dependency>

Then you add in the dependencies for your other logging frameworks.   See further http://www.slf4j.org/ and slf4j in search.maven.org
JAXB
docx4j relies very heavily on JAXB.  With Java 6 or 7, usually it’ll use the JAXB included in that (though things can be different with application servers – see the deployment forums for details).
The point here is that there is an alternative JAXB implementation, called EclipseLink MOXy (see http://www.eclipse.org/eclipselink/moxy.php), which is very well supported by its developers.  You can try it with docx4j.  To do so, just include the following additional dependencies:


org.docx4j
docx4j-MOXy-JAXBContext
3.0.0


org.eclipse.persistence
org.eclipse.persistence.moxy
2.5.1

/sourcecode]

Since using MOXy with docx4j is all quite new, you may run into some minor issues.  If you do, please let us know in the docx4j forums (with sufficient info for us to reproduce what you are seeing!).  Thanks.

docx4j 3.0 released

November 26th, 2013 by Jason

On behalf of everyone who has contributed to docx4j, Plutext is pleased to announce that version 3 was released today.

You can get it from Maven Central, or from http://www.docx4java.org/docx4j/ (the jar, the dependencies, or everything including documentation zipped up)

Source code is available at GitHub or from the Maven Central link above.  Javadoc is at Maven Central.

For what you need to know about docx4j 3.0, please see this post.

The XHTML Import stuff is now a separate project (since it and its dependencies are LGPL, not ASLv2 like docx4j).

  • the three jars you need (docx4j-ImportXHTML, xhtmlrenderer, and iText) are included for convenience in the zip file above.  You can delete them if you don’t need or want XHTML import.
  • or you can get it from Maven Central

docx4j 3.0 uses slf4j for logging.  For convenience, log4j is the default implementation.  A follow-up post will explain more about logging config.

Thanks to everyone who has helped to make this release our best yet!

If you have questions pertaining to the use of docx4j, please post them in our forum, or on StackOverflow (rather than in comments to this post).