Mar 12 2018

Scaling the PDF Converter with AWS Fargate

This is a walkthrough of deploying the PDF Converter on Amazon’s FarGate.

What is Fargate?  New since November 2017,  its an easy way of deploying containers on AWS ECS.  You don’t have to manage the underlying EC2 instances, and the wizard takes care of the setup, so you can be up and running in less than 20 mins!

With FarGate, you make a “cluster” which you can easily size to suit a known conversion volume, or have it auto-scale with load.  Largely thanks to Docker!

This walkthrough assumes you already have an AWS login.

To getting things working:

  1. there’s 4 steps in Amazon’s firstRun wizard: https://console.aws.amazon.com/ecs/home?region=us-east-1#/firstRun
  2. then you configure the health check path

But first, check things are configured correctly for ECS in your Amazon account.  Since FarGate currently only works in N.Virginia, visit https://console.aws.amazon.com/ecs/home?region=us-east-1#/getStarted

ECS FirstRun Wizard

If you don’t already see the “Getting Started” wizard pictured below, click https://console.aws.amazon.com/ecs/home?region=us-east-1#/firstRun (this is easier than “create new cluster” at https://console.aws.amazon.com/ecs/home?region=us-east-1#/clusters/create/new since it also creates a Service and Task, but more importantly, your load balancer).

fargate-firstrun-step1

In the “Container definition” section, click the “configure” button on the “custom” image.

Type the following in image: plutext/plutext-document-services:2.1-0, and set the other values as per the image below:

 

container-settings-dockerhub

Next, in “Task definition”, edit the task definition name, to say: pds-task-definition

Click next.

Service

On the “service” screen, click “edit” to set the number of tasks to 2, and choose “Application Load Balancer”.service

Click next.

Cluster

On this screen, just change the cluster name to: plutext-document-services

When you click next, the review screen should show:

review

Click “Create”.  The wizard will perform various tasks; it might take 3 or 4 mins.

When it is done, you should see:

preparing-service

Click the “view service” button.

Health Check

You need to set the health check path in your load balancer.  (Unfortunately, FarGate currently doesn’t populate this from the HEALTHCHECK statement in your Dockerfile)

So in your cluster, click your service, where you’ll see the load balancer target group:

cluster-service

 

Click that.

Now, you’re in your load balancer, where you can click “edit health check” and enter path:  /v1/00000000-0000-0000-0000-000000000000/ping

Result should be:

health-check

Before you go back to your service, click on the load balancer itself, and make a note of its DNS name.   You’ll see the host name there in the basic configuration:

alb-hostname

 

Now if you go back to your service, on the “tasks” tab, you should see:

tasks-status

ie “RUNNING”

Try it out!

To convert a document, you need the DNS host name of the load balancer you made a note of above.  Now you can test with something like:

curl -v -X POST –data-binary @HelloWorld.docx -o out.pdf http://EC2Co-EcsEl-1N1ULP12K5TGG-2127307716.us-east-1.elb.amazonaws.com:80/v1/00000000-0000-0000-0000-000000000000/convert

Check for “200 OK” and try opening out.pdf.

Next steps

In our next post, we’ll configure HTTPS, and in the one after that, we’ll add a license key.

Jun 30 2017

JAXB RI new home on the web

https://github.com/javaee/jaxb-v2 is JAXB’s new home (jaxb.java.net now redirects there).

Issue tracker is at https://github.com/javaee/jaxb-v2/issues (though it seems some existing issues didn’t get migrated over, for example https://github.com/gf-metro/jaxb/issues/22)

I’m not sure where you are supposed to get official binary releases from; Maven I guess?

http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.sun.xml.bind%22%20AND%20a%3A%22jaxb-ri%22

http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22javax.xml.bind%22%20AND%20a%3A%22jaxb-api%22

Discussion group is apparently at https://javaee.groups.io/g/metro, but first you need to join the parent group https://javaee.groups.io/g/javaee

https://javaee.github.io/ has details of other the projects migrated from Java.net to GitHub

Jul 25 2016

Tool review: Merging Word documents on the desktop

Sometimes Microsoft Word users need to join several Word documents into a single file, without loss of formatting.

An example would be a cover letter, a quote, and a contract.

Or a proposal or contract, plus appendices.

(In legal industry parlance the finalised collection of documents is called a “closing binder”, “electronic bundle” or less commonly, a deal bible.  And its usually in PDF…)

Word itself doesn’t do this for you.  So this blog post is a review of tools you might download/install to try to get the job done.

TLDR: I don’t want to be negative, but bottom line is they won’t help … the 5 tools I found and tested each do a poor job at this.  If you, dear reader, know of a better tool, please share in a comment!   Most people seem to convert their documents to PDF first, then merge the PDFs – for good reason!

By way of background/disclosure, our Docx4j Enterprise product is good at this (if I may say so myself), but we don’t sell it to end users.  This background allowed me to make a couple of very simple documents to test the products.

Without further ado, the 5 products I tested were:

My first test was to merge 2 documents which define Heading 1 differently.  Does the merged document keep the distinct appearance of the 2 headings?

The only product which was able to pass this basic test was Icestand’s.

Unfortunately Icestand failed my second test: it lost section formatting (page orientation, headers/footers).

And my third test: list numbering.  If doc 1 contains “1,2,3” and so does doc 2, in the merged output, do you get “1,2,3  1,2,3” or “1,2,3 4,5,6”?

Since all the others failed test 1, I didn’t subject them to tests 2 or 3.  But I found:

  • most/all of these programs require Word to be installed.  That’s probably OK for a desktop utility.
  • with the exception of kutools, which appears in Word’s ribbon, each presents as standalone/free-standing software with its own UI.
  • you’d expect to be able to arrange your input documents in the order you want, but 3steps lacked even that!

Since these products all lose formatting, unless there’s a better product out there somewhere, you’d get better results by converting to PDF, then merging the PDF files.   In this respect, bundledocsCaseLines and others look interesting; they do the PDF conversion as well, removing a step from the process.

Using PDF might be fine, if the deal is done.  And interestingly, in some cases, PDF might even be a requirement.  For example, the UK Supreme Court “recommends” it.

But if the documents are still being finalised, Word is preferable, since its an editing format. Anyone tried converting the resulting merged PDF file back to Word?!

 

 

May 11 2016

Docx4j Helper Word AddIn: new version v3.3.0

We’ve just published a new version of our Helper AddIn for Word; you can download it from  http://www.plutext.com/dn/downloads/1472868282152/Docx4j_Helper_3-3-1-03.exe (link updated 3 Sept 2016).

Here’s what it looks like:

helper-ribbon

Its new features:

  • code generation from your selection (no need to use the webapp interface)
  • generate PDF output using Plutext’s commercial PDF Converter (either a locally installed instance, or http://converter-eval.plutext.com)
  • document sanitisation/anonymisation so you can safely publish it for support purposes
  • easy docDefaults manipulation

I’ll run through these one by one.

code generation from your selection

With the old version, you launched a local version of the docx4j webapp to generate code.

You can still do that (click the Load Helper button, then Parts List); its a useful way to see all the parts in your docx.

With this version, if you click the “Inspect selection” button, you’ll see the corresponding XML:

editor-hello-world

What’s new is the “Java” button.  If you click that, it’ll generate corresponding code, and display that in your web browser:

code-part1

:

code-part2

 

generate PDF output

The “PDF” button generates a PDF.  Either of your whole document, or your selection.

It uses our commercial PDF Converter so its probably most useful:

  1. if you want to evaluate that, or
  2. have found something which doesn’t convert correctly, and want technical support

It can use either a locally installed instance, or http://converter-eval.plutext.com

You configure that with the “Config” button.

The generated PDF will open in whatever Windows opens PDFs with for you.

document sanitisation/anonymisation

The idea of the “anonymise” button is to make it easy for you to email/publish a docx (or PDF) for support purposes, without giving away sensitive info.

Again, if nothing is selected, it’ll do your whole docx.  Otherwise, just your selection.

The results will be saved to a temporary local docx (so your source docx is unaffected), then opened.

If there is anything potentially sensitive the code can’t remove, it’ll let you know.

The code which does this is at https://github.com/plutext/docx4j/tree/master/src/main/java/org/docx4j/anon

easy docDefaults manipulation

Its sometimes useful to see/edit your doc defaults.

docdefaults

Your changes will open in a new temp docx.

For example, try changing font size (w:sz) from 22 to 42.

You can also make changes in Word’s Paragraph and Font dialogs by pressing the “Set as Default” button, then look here to see how that translates into XML.  Without this, its hard to be sure whether you’ve changed your default styles, or docDefaults!

Feedback and Comments

 

Any feedback, comments, requests for new features etc, please post in http://www.docx4java.org/forums/docx4jhelper-addin-f30/   Alternatively, Plutext customers can email support.

Apr 13 2016

PDF/A-2b compliant Word to PDF

Plutext’s commercial PDF Word/docx Converter now produces fully PDA/A-2b compliant PDF output.

We say this having tested its output using http://verapdf.org/ “a purpose-built, open source, file-format validator covering all PDF/A parts and conformance levels.”

You can try our PDF Converter now, at http://converter-eval.plutext.com/

Sep 26 2015

Aspose.Confusion in Words

In June, Aspose’s Shoaib Khan published a blog post purporting to cover features available in Aspose.Words for Java but not docx4j.

It is either breathtaking or amusing in its inaccuracy, depending on whether you think it was born of deceit or ineptitude.  Either way, its a caution to anyone considering drinking the Aspose.Kool-Aid!

Here I’ll go through his claims one by one.

As a general comment though, it is worth remembering that with docx4j, you can do pretty much anything the docx/pptx/xlsx file formats allow.  If docx4j doesn’t have a high level API for something you want to do, you can always implement it yourself, thanks to docx4j’s lower level JAXB-based APIs.  And docx4j is real ASLv2 open source, so you can use the source Luke!

Without further ado…

Set Page Borders.
Here Aspose seems to be talking about section properties (ie margins etc).

Shoaib implies you can’t control these in docx4j.  Of course you can!  You can add or remove sections, or modify the settings of an existing section.

Track Changes in Documents.
Their AcceptAllRevisions method is said to be similar to Word’s “Accept All Changes”.

Docx4j doesn’t provide a high level API for doing this (since users haven’t asked for it), but a user could implement this for his/herself easily enough using XSLT or docx4j’s TraversalUtil.  You could start with this XSLT

Using Control Characters.
This example is a bit bizarre, because in a docx, specific elements w:br and w:cr are used for line breaks;  OpenXML follows the usual XML rules for whitespace.

Split Tables.
This example shows the steps a user would follow to split a table into two.  Basically, clone the existing table to make a new table with the same properties, then move rows from the first table to the second.

Of course you can do the same thing in docx4j!

Repeat Table Header Rows on Pages.
This example is just about setting the header row property.  See http://stackoverflow.com/questions/14605264/repeat-table-header

Clone Documents.
Cloning a document.  Aspose suggest you can’t do this with docx4j? WTF?! OpcPackage’s clone() method has been there since 2.7.1

docx4j also includes code for making partial copies, where less than a full clone is required.

Moving the Cursor in Document.
OpenXML is of course XML, which is hierarchical.  docx4j uses JAXB to give you an object model representation of that.  The hierarchical structure is basically nested lists.  Lots of stuff boils down to finding a position in a list, and then inserting or deleting etc using the Java collections API.  To find that list/position, you’d typically use docx4j’s powerful traversal functionality.   Or you can use XPath (in JAXB, the objects are bound to the underlying XML).

Aspose has some notion of cursor position, so you can move to the start or end of the document.  This may appeal to people with a VBA background, but in practice it is of little use.

Protect Documents.
Whilst it is true that in docx4j 3.2.0 there was no high level API for the functionality Microsoft Word groups under Protect Document (Mark as Final, Encrypt with Password, Restrict Editing etc), this is available now in the 3.3.0 previews:

Working with Digital Signatures.
As with the other Protect Document features, there was no high level API in 3.2.0.  That’s not to say you couldn’t do it, but there’s a nice API for this in the commercial Enterprise Ed. (forthcoming v3.3.0)

Check Format Compatibility.
This seems to be restricted to knowing what type of document you are working with; Aspose says it doesn’t validate the file format.

Per the specs, an OPC package has a content type.  In docx4j, docx/dotx/docm etc are all represented by WordprocessingMLPackage, but you can distinguish between them by calling getContentType(); the value will be one of:

  • WORDPROCESSINGML_DOCUMENT
  • WORDPROCESSINGML_DOCUMENT_MACROENABLED
  • WORDPROCESSINGML_TEMPLATE
  • WORDPROCESSINGML_TEMPLATE_MACROENABLED

Load Text File.
Whoopey do.  Apparently you can import plain text using Aspose’s expensive software!

Of course that is trivial with docx4j.  Docx4j also supports converting altChunks to native WordML content.  For XHTML altChunks, you need docx4j-ImportXHTML;  support for altChunks of type docx is an Enterprise level feature.

Specify Default Fonts.
The way the default font works in WordprocessingML is more complicated than you might expect, in that there are a few different ways you could affect it (via the theme part or the styles part).

That said, in real life, this doesn’t tend to be a problem.  With docx4j, you can easily set  w:docDefaults/w:rPrDefault/w:rPr/w:rFonts in your styles part, if you want to.

Working with Tables:
Autofit Setting to Tables.
This is just about the tblLayout setting: w:tblPr/w:tblLayout/@w:type, which you can access via TblPr’s get/setTblLayout

Joining Tables in Document.
I can’t recall anyone ever asking for docx4j to provide a high level API to do this, but it could be added.  In the meantime, docx4j allows to you to do anything with tables which the file format allows, including joining tables.

Mail Merge
Mail Merge from XML Data Source.
docx4j provides a high level API for working with legacy MERGEFIELD fields.

If you wanted to fill those fields with data from XML, you could do that easily enough.

Where docx4j really shines though, is in its support for content control data binding.  In that approach, introduced by Microsoft in 2007, you have a bidirectional XPath mapping between content controls in the document, and an XML file.

If you are working with XML, and not forced to work with legacy MERGEFIELDs for some reason, content control data binding is the way to go.

Jun 16 2015

Off topic: Eclipse’s maven from a command line?

You’ve installed Eclipse.

Eclipse includes maven (m2e).

Can you use that Maven from outside Eclipse, or do you need to install maven again/separately?

It turns out you can use it.  Whether its worth the effort or not is another question…

You launch maven using the plexus classworlds launcher.

That needs a config file.

The config file (call it m2.conf) contains something like:

main is org.apache.maven.cli.MavenCli from plexus.core

set maven.home default /home/jason/.m2

[plexus.core]
load /home/jason/eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/*.jar

With that, from a project dir, the following is the equivalent of ‘mvn install’:

java -cp ../eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/plexus-classworlds-2.5.1.jar:../eclipse/plugins/org.slf4j.api_1.7.2.v20121108-1250.jar  “-Dclassworlds.conf=m2.conf” org.codehaus.plexus.classworlds.launcher.Launcher install

You could make a shell script to do that.  And you could base the shell script on the one included in the maven distribution.  Or more sensibly, you’d just download  install and use  maven proper.

If you need it.  Right clicking on your project in Eclipse then Run As gives you a handy UI for maven-related stuff:

Finally, note it is  possible to avoid the plexus classworlds launcher by invoking MavenCli directly:

java -cp “../eclipse/plugins/org.slf4j.api_1.7.2.v20121108-1250.jar:../eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/*” org.apache.maven.cli.MavenCli install

If you are going to use that, you might wonder about setting maven.home with -Dmaven.home=/your/path/to/where immediately after the classpath

Jun 16 2015

docx4j from GitHub in Eclipse – 3 years on

In May 2012 we posted docx4j-from-github-in-eclipse. That was more than 3 years ago now, so its about time to update that walkthrough :-)

This post is about getting the docx4j source code setup in Eclipse, so you can not just use it, but easily study it as well (and submit pull requests!).  If you have no interest or need to do that, please see hello-maven-central (if you’re already using a recent Eclipse, you can start at step 4) and/or docx4j-3-0-and-maven (but do use our current version 3.2.x)

Preliminaries – JDK

Make sure you have the JDK installed; Java 6 or later.  The JRE alone is not enough, since it doesn’t include a compiler (javac).

Preliminaries – Eclipse

Install Eclipse.  These days, the basic package has everything you need (ie git and maven support):

Git & GitHub

GitHub is docx4j’s authoritative source repository.  Eclipse now includes a git client. (If you have an older Eclipse, you can install eGit) However, it is still handy to have other git clients installed:

  • on Linux (listed first, given git’s provenance), install git using your distribution’s package manager
  • on Windows, the Git BASH shell is handy; as is Atlassian’s SourceTree
  • on OSX, ditto

Clone or Fork?

With Git, there is a difference between cloning and forking.

  • Cloning gives you a copy of the source code you can work on, but without more, no easy way to contribute changes back.
  • Forking sets you up with the source code, and makes it easy to contribute changes back.

If you think you might be making changes to the docx4j source code, you’re probably best to create a fork on GitHub right from the start.

To create a fork, log in to GitHub, visit https://github.com/plutext/docx4j then press the “Fork” button.

Choose your poison

There are 3 steps to installing docx4j:

  1. clone the docx4j repo
  2. install its dependencies
  3. install docx4j project in Eclipse

You can do these 3 steps entirely within Eclipse, but Eclipse by default doesn’t give much feedback as to what its doing, so you might wonder whether its still working properly.

Since its just as easy (or easier) to use the command line, I’ll show that way first:-

Command Line Approach

To do it this way, you’ll need:

  • a git shell, and
  • Maven

Both of these are worth having in any case.

Step 1. To clone docx4j from your git shell, use the github URL for docx4j (your fork or Plutext’s):

$ git clone -b master –single-branch https://github.com/plutext/docx4j docx4j
Cloning into ‘docx4j’…
remote: Counting objects: 42008, done.
remote: Compressing objects: 100% (58/58), done.
remote: Total 42008 (delta 23), reused 7 (delta 0), pack-reused 41946
Receiving objects: 100% (42008/42008), 61.03 MiB | 128.00 KiB/s, done.
Resolving deltas: 100% (25108/25108), done.

You should now have a docx4j directory, containing the docx4j source code.

Step 2. Next, to get docx4j’s dependencies, you’ll need Maven.

So first, install Maven (if you don’t have it already).  Please see the instructions at maven.apache.org (actually, you’ve already got Maven in Eclipse, but its a bit hard to use from the command line).

Now you can go into your docx4j directory, and type:

mvn install -DskipTests=true

You’ll see Maven download docx4j’s dependencies.

Step 3. Now you are ready to start Eclipse.

Because docx4j includes Eclipse project definition files, you can import the docx4j project.

From the File menu, click Import, then Existing Projects into Workspace:

Browse to your docx4j directory:

Then click Finish.

Now the project should be set up correctly.  If you see errors, please refer further below for troubleshooting.

Eclipse only approach

Step 1. clone the docx4j git repo in Eclipse; for this you need the Git Repositories View:

Window > Show View > Other > Git > Git Repositories View

Click “Clone a Git repository” then enter the URI for docx4j (your fork, or Plutext’s), then click Next.

The master branch is probably all you need (though Eclipse will probably fetch all the others at some point anyway!)

Step 2. On the next screen, you can tick “Import all existing projects after clone finishes“.  (if you don’t do that, you’ll have to manually File > Import, then Existing Projects into Workspace, as explained above)

Step 3. Eclipse will now start building the project; first Maven will get the dependencies.

This may take a while … to see what Eclipse is doing while it displays the status “Building workspace”, from the Console view, click the drop down to see the Maven Console:

There you can watch it downloading stuff.

You can also look at the Progress view.

When its all done, you should have a docx4j project there and ready to go!

Troubleshooting

I don’t cover issues with git clone or maven here; just issues with Eclipse.

If Eclipse has a problem with your docx4j project, you’ll see an exclamation mark:

You can see further info in the Problems view; the most likely problem is that your Java is misconfigured:-

To fix this, on the docx4j project, click Alt-Enter to go into its properties.

Then click Java Build Path, then the Libraries tab.

Do you see a red cross next to JRE System Library, as above?

If so click on the JRE System Library entry to select it, then click the Remove button.

Next click Add Library, the JRE System Library, then add one (1.6 or above).

Note the warning:

That’s OK, we changed the JRE on the Java Build Path up above.

Hello World

Now you are ready to run some docx4j code.

A good place to start is to run  CreateWordprocessingMLDocument

Use docx4j in your own project

To use docx4j in your own project, there are 2 approaches:

  • the Maven way.  If you’re planning to use Maven, you just specify docx4j as a dependency, and if the version matches (look in pom.xml), it’ll use your docx4j project (assuming workspace resolution is switched on).  Please see hello-maven-central (if you’re already using a recent Eclipse, you can start at step 4) and/or docx4j-3-0-and-maven (but do use the version specified in pom.xml)
  • or, via the Java Build Path > Projects tab.

Feb 10 2015

High fidelity PDF output

This post introduces our new commercial component for docx to PDF output.

The background is that docx4j’s standard method of producing PDF output has been via XSL FO, using Apache FOP.

This has worked well enough for some docx4j users, but it has certain limitations which can bite you, for example lack of tab and tab stop support in XSL FO.

And because there are differences between FOP’s layout engine and Word’s, page breaks may fall in different places.

This means the FO based PDF output in docx4j is about as good as its going to get (short of enhancing the FO renderer).

To do better, we’ve had to invest in a non-FO approach, using layout algorithms specifically designed to give the same results Word does.

You can try it now.

A side benefit is that this new approach is much faster than the FO approach.

The component is actually independent of docx4j.  This means it’ll also work great  if you need to convert docx to PDF from C# (without Word), Python, PHP etc.

Pricing is at plutext.com

Jan 20 2015

Content controls for business data connectivity

Sometimes, Word is a natural way for people to interact with back end applications (eg SAP).

This is particularly so when:

  • business data will be output to a Word document,
  • the user is more familiar with Word than the other system,
  • certain data updates may be required (and are permitted)

So maybe there are 4 high level categories:

  • apps which support commercial transactions (a recipient will receive a Word document), eg
    • employment onboarding (letter of employment)
    • invoicing
  • apps with a reporting component:  is the report format a natural interface if it was made bi-directional?
  • workflow/BPM systems which present documents (work orders, proposals, approvals etc)
  • policy/procedures in regulated industries, where a worker must follow a series of steps.  Can present that in a docx; the worker can tick the steps off as they do them
    • related training scenarios?

Microsoft had an emphasis on what they then called “Office Business Applications” back around the Office 2007 launch.  Fast forward to today, and “business connectivity services” are part of SharePoint.

But you can achieve the same sort of thing without SharePoint, using docx4j and data bound content controls.

Once you have your back end data in an XML format (and there are many tools/techniques to help with this), you can use content controls to bind what the user sees in Word to elements in that XML.

The beauty of it is that the binding is bi-directional, so if the user edits the document, the XML is updated (ie it stays in synch).

After the user has made their edit, you can update the back end application.  Typically, you’d do this after they saved & closed the document (ie outside Word, using docx4j), but you could also do it from within Word (a less good approach, but still, an option).

What if there is some data which the user shouldn’t be able to edit?  You simply lock the content control to prevent editing.

To quickly try out this approach, put together some sample XML, then upload it as explained here, to get a docx you can experiment with.

We’d loved to hear about how you might use this approach?