Sep 05 2020

Fast PDF templating using XSL FO

If your objective is to generate PDF documents only (ie you have no need for docx or HTML output), then you might consider an XSL FO templating approach.

That is, you do variable replacement in the XSL FO document.

This could be faster than generating a docx file first each time.

You could use any of the templating libraries for this. Google “java xml templating -sap -hana”

Using a templating library is probably better than working with a lower level Java XML API so you don’t reinvent the wheel.

For example, https://www.thymeleaf.org/ (especially if you are using Spring).

Below is a quick demo of using a template library called pebble to create an invoice:

  <dependency>
    <groupId>io.pebbletemplates</groupId>
    <artifactId>pebble</artifactId>
    <version>3.1.4</version>
</dependency>

This demo shows simple variable replacement, repeating content, and conditional content. Java and XSL FO template attached.

How do you get the XSL FO in the first place? You can create a docx document using Microsoft Word, then convert that to XSL FO using docx4j.

Then you add the templating commands to the XSL FO file using your favourite text editor.

Java code:

import java.io.IOException;
import java.io.StringWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import com.mitchellbosecke.pebble.PebbleEngine;
import com.mitchellbosecke.pebble.template.PebbleTemplate;

public class PebbleFoTemplating {

public static void main(String[] args) throws IOException {

PebbleEngine engine = new PebbleEngine.Builder().build();
PebbleTemplate compiledTemplate = engine.getTemplate(System.getProperty("user.dir") + "/invoice_fo.xml");

Map<String, Object> context = new HashMap<>();

context.put("name", "Mitchell");

// repeat demo, see https://pebbletemplates.io/wiki/tag/for/
List<Fruit> repeat_fruit = new ArrayList<Fruit>();
repeat_fruit.add(new Fruit("apples", "$20"));
repeat_fruit.add(new Fruit("oranges", "$40"));
context.put("fruitList", repeat_fruit);

// condition, see https://pebbletemplates.io/wiki/tag/if/
context.put("condition1", Boolean.FALSE);


Writer writer = new StringWriter();
compiledTemplate.evaluate(writer, context);

String output = writer.toString();
System.out.println(output);

}

static class Fruit {

Fruit(String name, String price) {
this.name=name;
this.price=price;
}

public String name;
public String price;
}

}

Pebble XSL FO template:

<?xml version="1.0" encoding="utf-8"?>
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006">
<layout-master-set
xmlns="http://www.w3.org/1999/XSL/Format">
<simple-page-master margin-bottom="12mm"
margin-left="1in" margin-right="1in" margin-top="12mm"
master-name="s1-simple" page-height="297mm" page-width="210mm">
<region-body column-count="1" column-gap="12mm"
margin-bottom="36.0pt" margin-left="0mm" margin-right="0mm"
margin-top="36.0pt" />
<region-before extent="0.0pt"
region-name="xsl-region-before-simple" />
<region-after extent="0.0pt"
region-name="xsl-region-after-simple" />
</simple-page-master>
<page-sequence-master master-name="s1">
<repeatable-page-master-alternatives>
<conditional-page-master-reference
master-reference="s1-simple" />
</repeatable-page-master-alternatives>
</page-sequence-master>
</layout-master-set>
<fo:page-sequence force-page-count="no-force"
id="section_s1" format="" master-reference="s1">
<fo:flow flow-name="xsl-region-body">

<fo:block font-size="11.0pt" line-height="115%"
space-after="4mm" white-space-treatment="preserve">
</fo:block>
<fo:block font-size="11.0pt" line-height="115%"
space-after="4mm" text-align="center">
<inline xmlns="http://www.w3.org/1999/XSL/Format"
font-family="Calibri">INVOICE</inline>
</fo:block>
<fo:block font-size="11.0pt" line-height="115%"
space-after="4mm">
<inline xmlns="http://www.w3.org/1999/XSL/Format"
font-family="Calibri">{{ name }}</inline>
</fo:block>
<fo:block font-size="11.0pt" line-height="115%"
space-after="4mm" white-space-treatment="preserve">
</fo:block>

<fo:table border-bottom-color="#000000"
border-bottom-style="solid" border-bottom-width="0.5pt"
border-collapse="collapse" border-left-color="#000000"
border-left-style="solid" border-left-width="0.5pt"
border-right-color="#000000" border-right-style="solid"
border-right-width="0.5pt" border-top-color="#000000"
border-top-style="solid" border-top-width="0.5pt"
display-align="before" start-indent="0in" table-layout="fixed"
width="159mm">
<fo:table-column column-number="1"
column-width="119mm" />
<fo:table-column column-number="2"
column-width="1.58in" />
<fo:table-body start-indent="0in">
<fo:table-row>
<fo:table-cell border-bottom-color="#000000"
border-bottom-style="solid" border-bottom-width="0.5pt"
border-left-color="#000000" border-left-style="solid"
border-left-width="0.5pt" border-right-color="#000000"
border-right-style="solid" border-right-width="0.5pt"
border-top-color="#000000" border-top-style="solid"
border-top-width="0.5pt" padding-bottom="0mm"
padding-left="1.91mm" padding-right="1.91mm" padding-top="0mm">
<block xmlns="http://www.w3.org/1999/XSL/Format"
font-size="11.0pt" line-height="100%" space-after="0in">
<inline font-family="Calibri">Item</inline>
</block>
</fo:table-cell>
<fo:table-cell border-bottom-color="#000000"
border-bottom-style="solid" border-bottom-width="0.5pt"
border-left-color="#000000" border-left-style="solid"
border-left-width="0.5pt" border-right-color="#000000"
border-right-style="solid" border-right-width="0.5pt"
border-top-color="#000000" border-top-style="solid"
border-top-width="0.5pt" padding-bottom="0mm"
padding-left="1.91mm" padding-right="1.91mm" padding-top="0mm">
<block xmlns="http://www.w3.org/1999/XSL/Format"
font-size="11.0pt" line-height="100%" space-after="0in">
<inline font-family="Calibri">Price</inline>
</block>
</fo:table-cell>
</fo:table-row>
{% for fruit in fruitList %}
<fo:table-row>
<fo:table-cell border-bottom-color="#000000"
border-bottom-style="solid" border-bottom-width="0.5pt"
border-left-color="#000000" border-left-style="solid"
border-left-width="0.5pt" border-right-color="#000000"
border-right-style="solid" border-right-width="0.5pt"
border-top-color="#000000" border-top-style="solid"
border-top-width="0.5pt" padding-bottom="0mm"
padding-left="1.91mm" padding-right="1.91mm" padding-top="0mm">
<block xmlns="http://www.w3.org/1999/XSL/Format"
font-size="11.0pt" line-height="100%" space-after="0in">
<inline font-family="Calibri">{{ fruit.name }}</inline>
</block>
</fo:table-cell>
<fo:table-cell border-bottom-color="#000000"
border-bottom-style="solid" border-bottom-width="0.5pt"
border-left-color="#000000" border-left-style="solid"
border-left-width="0.5pt" border-right-color="#000000"
border-right-style="solid" border-right-width="0.5pt"
border-top-color="#000000" border-top-style="solid"
border-top-width="0.5pt" padding-bottom="0mm"
padding-left="1.91mm" padding-right="1.91mm" padding-top="0mm">
<block xmlns="http://www.w3.org/1999/XSL/Format"
font-size="11.0pt" line-height="100%" space-after="0in">
<inline font-family="Calibri">{{ fruit.price }}</inline>
</block>
</fo:table-cell>
</fo:table-row>
{% endfor %}
</fo:table-body>
</fo:table>
<fo:block font-size="11.0pt" line-height="115%"
space-after="4mm">
<inline xmlns="http://www.w3.org/1999/XSL/Format"
font-size="8.0pt" />
</fo:block>
<fo:block font-size="11.0pt" line-height="115%"
space-after="4mm">
<inline xmlns="http://www.w3.org/1999/XSL/Format"
font-family="Calibri">Please remit funds to ABC Bank, account number 123 456
789. </inline>
<inline xmlns="http://www.w3.org/1999/XSL/Format"
font-size="8.0pt" />
</fo:block>
{% if condition1 %}
<fo:block font-size="11.0pt" line-height="115%"
space-after="4mm">
<inline xmlns="http://www.w3.org/1999/XSL/Format"
font-family="Calibri">This paragraph should be included.</inline>
</fo:block>
{% endif %}


</fo:flow>
</fo:page-sequence>
</fo:root>

Sep 05 2020

Office pptx/xlsx/docx to PDF to in docx4j 8.2.3

docx4j 8.2.3 facilitates 3 distinct ways to convert Microsoft Word docx documents to PDF. There are also possibilities for converting pptx or xlsx to PDF.

The three approaches:

  • export-fo: the content is converted to XSL FO, and from there, to PDF (or any of the other formats supported by Apache FOP)
  • documents4j: since 8.2.0, use Microsoft Word to do the conversion
  • via-Microsoft-Graph: new in 8.2.3, use java-docx-to-pdf-using-Microsoft-Graph to do the conversion

So which should you choose? The following table covers some of the things you might want to consider:

export-FO Microsoft Graph documents4j
Overview Conversion of docx to XSL FO, then uses Apache FOP to convert to PDF Uses Microsoft’s cloud Uses your Microsoft Office installation 
Fidelity Suitable for simple documents (text, tables, supported image types, header/footers) 100% (Microsoft’s fidelity) 100% (Microsoft’s fidelity)
Suitability simple docx docx, pptx, xlsx docx, xlsx 
License considerations ASL v2 Refer applicable Microsoft cloud terms Refer Microsoft EULA governing your Office install 
(increasingly restricted with each release)
Cost Free Microsoft cloud costs (Microsoft Office)
Confidentiality documents don’t leave your server documents go to Microsoft cloud documents need not leave your servers
Other advantages – Fast XSL FO/PDF templating for high volume PDF creation
– Open source, so can be extended
– Microsoft encourages this approach
– Microsoft cloud handles scalability
– Can update a docx table of contents
– Can convert RTF and binary .doc
Other disadvantages Two step (docx to XSL FO to PDF) processing is slower (except for XSL FO templating) – Dependency on 3rd party cloud
– Currently can’t update docx table of contents
vote to fix
– documents4j doesn’t support pptx
– Not supported by Microsoft

Mar 16 2020

documents4j for TOC update

documents4j can also be used to update the TOC page numbers in your docx file.

For this, there are 2 adjustments to our previous post.

The first is that you convert to(target).as(DocumentType.DOCX), not DocumentType.PDF, so you get docx output.

The second is that you need a customised word_convert.vbs containing, for example:

    ' Update TOC
    wordDocument.TablesOfContents(1).UpdatePageNumbers

This code will update the first TablesOfContents.

See further https://github.com/documents4j/documents4j/blob/master/documents4j-transformer-msoffice/documents4j-transformer-msoffice-word/src/main/resources/word_convert.vbs

word_convert.vbs is typically found in your documents4j-transformer-msoffice-word.jar

Mar 16 2020

documents4j for PDF output

Generating high fidelity PDF output from Office documents has always been a challenge, given the “long tail” of features which are possible in docx/pptx/xlsx files.

For Word documents, it is easy enough to output paragraphs of text, tables and images. But add in VML, DrawingML, equations, SmartArt, and fidelity becomes a challenge.

If your documents are constrained, you may be able to find a suitable conversion tool. Plutext’s PDF Converter was a good example of this. It worked well on a growing range of documents.

But ultimately, if you want great fidelity on a unconstrained set of files, you need to be using Microsoft’s own Office layout engine.

There are various ways to do that, for example https://developer.microsoft.com/en-us/graph/examples/document-conversion

For Java developers, a good solution is documents4j.

It uses Office and the Microsoft Scripting Host for VBS on the conversion machine, so that machine must Microsoft Windows.

Documents4j can run either a “LocalConverter” or a “RemoteConverter”.

Using a LocalConverter is as simple as:

import java.io.File;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;

import com.documents4j.api.DocumentType;
import com.documents4j.api.IConverter;
import com.documents4j.job.LocalConverter;

public class ToPDF {

	public static void main(String[] args) {

		File wordFile = new File( System.getProperty("user.dir")+"/input.docx" ); 
		File target = new File( System.getProperty("user.dir")+"/output.pdf" );
		
		IConverter converter = LocalConverter.builder()
                .baseFolder(new File("C:\\temp"))
                .workerPool(20, 25, 2, TimeUnit.SECONDS)
                .processTimeout(30, TimeUnit.SECONDS)
                .build();		
                
       Future<Boolean> conversion = converter
                                .convert(wordFile).as(DocumentType.MS_WORD)
                                .to(target).as(DocumentType.PDF)
                                .prioritizeWith(1000) // optional
                                .schedule();
               
	}

}

From Maven, you just need these dependencies:

		<dependency>
			<groupId>com.documents4j</groupId>
			<artifactId>documents4j-local</artifactId>
			<version>1.1.1</version>
		</dependency>

		<dependency>
			<groupId>com.documents4j</groupId>
			<artifactId>documents4j-transformer-msoffice-word</artifactId>
			<version>1.1.1</version>
		</dependency>

		<dependency>
			<groupId>org.slf4j</groupId>
			<artifactId>slf4j-simple</artifactId>
		</dependency>

For a successful conversion, your logs will contain:

[main] INFO com.documents4j.conversion.msoffice.MicrosoftWordBridge - From-Microsoft-Word-Converter was started successfully
[main] INFO com.documents4j.job.LocalConverter - The documents4j local converter has started successfully
[pool-1-thread-1] INFO com.documents4j.conversion.msoffice.MicrosoftWordBridge - Requested conversion from input.docx (application/vnd.com.documents4j.any-msword) to output.pdf (application/pdf)

Jan 03 2019

OpenDoPE and XPath 2.0/3.0

Docx4j generally uses Apache XPath (org.apache.xpath), from the Xalan 2.7.2 jar.  (docx4j uses Xalan plus Xalan-specific extension functions for XSLT in various places including HTML export and OpenDoPE processing).

There are 2 main places where docx4j uses XPath:

  1. JaxbXmlPartXPathAware contains method getJAXBNodesViaXPath, which – thanks to JAXB’s concept of a binder – you can use to select objects (say P objects) in your MainDocumentPart
  2. OpenDoPE content control data binding: XPath is central to content control data binding (binding document content to XML data via XPath).

XPath 2.0 became a W3C Rec in 2007;  XPath 3.0 became a W3C Rec in 2014.

Sadly, Apache XPath has languished at XPath 1.0 level: https://intellectualcramps.wordpress.com/2009/01/12/xerces-getting-xpath-2-0-support/ and http://apache-xml-project.6118.n7.nabble.com/XSLT-2-0-td20898.html

Saxon, in contrast, has supported XPath 2.0 for ages, and also supports 3.1.

In docx4j 6.1.0 we made it easy for you to try Saxon for case 1 (JaxbXmlPartXPathAware getJAXBNodesViaXPath):

Step 1: add Saxon to your classpath, for example (Maven):

<dependency>
  <groupId>net.sf.saxon</groupId>
  <artifactId>Saxon-HE</artifactId>
  <version>9.9.0-2</version>
</dependency>

Step 2: add the following early in your code:

XPathFactoryUtil.setxPathFactory(new net.sf.saxon.xpath.XPathFactoryImpl())

In docx4j 6.1.0, this only affects case 1.  OpenDoPE content control data binding would still use Apache XPath.

In docx4j 8.0.0, Saxon would also be used for OpenDoPE content control data binding.

An example: date comparison

You can add an OpenDoPE conditional content control, in which the content is inserted only if XPath “xs:date(/invoice/date) > xs:date(‘2018-12-31’)” is true.  (date comparison is harder in XPath 1.0: https://stackoverflow.com/questions/4347320/xpath-dates-comparison )

For this to work, you need the prefix mapping xmlns:xs=’http://www.w3.org/2001/XMLSchema’, so your XPath in the OpenDoPE XPaths path would look something like:

<xpath id="dateGt">
  <dataBinding xpath="xs:date(/invoice/date) &amp;gt; xs:date('2018-12-31')" 
 prefixMappings="xmlns:xs='http://www.w3.org/2001/XMLSchema'" 
 storeItemID="{8B049945-9DFE-4726-9DE9-CF5691E53858}"/>
</xpath>

(for now, you need to manually edit the zipped docx to add that; I’ll update the authoring tools to do it in due course)

You can try this example right away:

Try changing the date in invoice-data.xml to say, 2018-01-15, then observe the affect on the output docx.

Just to re-iterate, you need Saxon for this to work. Xalan’s XPath will cause an exception.

org.eclipse.wst.xml.xpath2.processor is an interesting possible alternative, but it is not in Maven Central, not as well-known as Saxon, and possibly not so easy to get support?

Jul 01 2018

markdown to docx

I was looking for a way to convert swagger2markup markdown output to docx using Java (as opposed to Pandoc).

I found flexmark-java which describes itself as:

Java Markdown parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.

Its MD to DOCX is in flexmark-docx-converter module, which happily, uses docx4j under the covers.

I tried it, and found it works nicely, the exception being table rendering if you open the resulting docx in LibreOffice (Word automatically sets the column widths, but in LibreOffice 5.3 Writer, the column widths are bad and painful to fix).   The underlying issue is that markdown doesn’t specify column widths, and in docx4j, we don’t provide an algorithm to help the user set sensible values.

Development of flexmark-docx-converter was  sponsored by Johner Institut GmbH (medical device documentation)

The documentation for flexmark-docx-converter is at https://github.com/vsch/flexmark-java/wiki/Docx-Renderer-Extension

That page says:

Word does not handle inserted HTML very well.

It would be quite straightforward to use docx4j-ImportHTML to work around that :-)

 

Apr 28 2018

From VariableReplace to OpenDoPE data binding

This blog post is a walkthrough of how to easily move from variable replacement to OpenDoPE content control data binding.

Introduction

Variable replacement is quite a popular way to get started generating Word documents.

I guess that’s because developers expect this sort of approach to be available, and its easy: all you have to do is add your variables to the document, then bang, you replace them.

But its not all a bed of roses, there’s some thorns in there too:

  1. the so-called “split run” problem, in which Word has split your variable name across more than one XML element, due to formatting, spelling/grammar etc
  2. variable replacement is great if you just want to replace text, but what if you want to replace images, conditionally include/exclude content, or repeat table rows or list entries?

Content control data binding is a great solution to these problems.

Your data (provided in XML format) is bound to content controls using XPath, and with the OpenDoPE conventions, this approach offers:

Some users create very complex contracts and reports this way.

Automated Migration

The good news is that docx4j contains code to automatically migrate a document which has variables on its surface, to one which contains OpenDoPE content controls.

The code is in FromVariableReplacement.java

Have a look at the main method to see how to use it.

There have been some fixes recently, so you should use docx4j-nightly-20180428.jar (or 3.3.8 when released) or later.

OK, let’s assume you now have a docx file with content controls in it.  You may want to further develop your template.  For this you need an OpenDoPE Authoring tool.

OpenDoPE Authoring

The friendliest OpenDoPE authoring tool is the “No-XML” Word AddIn.

With this it is very easy to add conditions, but the limitation is that it assumes a fixed XML format.   If you want to use your own XML format (or to bind escaped XHTML I suspect), you’ll need to use one of the older add ins.

Here we’ll walk through adding a condition with “No XML” add in.

For this example, we’ll use the Commonwealth of Australia’s model Confidentiality Agreement, available at http://www.business.gov.au/IPToolkit

Here’s what the first few blanks look like, represented as content controls with the “No XML” AddIn’s ribbon showing in Word:

NoXMLAddIn1

I had pressed the “Show tags” button to be able to see the content controls in orange above.

Further down, there’s an optional Indemnity clause.

Since its optional, let’s wrap that in a conditional content control.  First, we need a question “Do you want the Indemnity”.  It works this way because this AddIn is aimed primarily at the interactive use case, that is, a user can answer questions in their web browser to generate an instance document.

But the resulting template can be used just as easily for the more common non-interactive / entirely programmatic case.

So click the “Insert Q/A” button.  I did this with my cursor somewhere in the middle of the Indemnity clause.

Fill in the form (for Multiple Choice choose yes):

qa_mcq

click next, then on the next page, choose type boolean (true/false), then ok.

You’ll see a content control inserted where your cursor was.  We don’t really want that, so its a bit annoying (you can/should delete it).  You’ll see why we did this just below.

Now select the clause heading and body, and click “Wrap with Condition”.  You’ll see something like:

condition1

In the condition builder, define the following condition:

indemnity_condition

then click OK. (Now you can see why we needed to set up that question first)

Our resulting conditional clause appears as follows:

indemnity_result

That’s all you need to do.  We can now try generating an instance document from this template.

Document generation runtime

To generate a document, use docx4j code to populate an Answers object, then call Docx4J.bind. For example:

WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(new java.io.File(inputfilepath));

answers = new Answers();

addAnswer("Sponsor_name_ACNABN_oW", "CSIRO of Some St, Sydney")
addAnswer("want__Indemnity_clause_K8", "true"); // or false
// etc

Docx4J.bind(wordMLPackage, answers, Docx4J.FLAG_BIND_INSERT_XML &amp; Docx4J.FLAG_BIND_BIND_XML);
Docx4J.save(wordMLPackage, new File(outputfilepath), Docx4J.FLAG_NONE);

where addAnswer is just:

private void addAnswer(String key, String value) {
Answer a = new Answer();
a.setId(key);
a.setValue(value);
answers.getAnswerOrRepeat().add(a);
}

How do you know what key to use?  Look in the answers part in the docx and use the corresponding ID (yes, you should be able to see this in the AddIn, but the reason you can’t is that for the interactive use-case, you never need to know), or you can just invoke Docx4J.bind with debug level logging enabled for org.docx4j.model.datastorage, and it will print out the relevant part.

answersPart

That’s about it.  If you have questions, they are probably best posted in the relevant docx4j forum or on StackOverflow.

 

 

 

 

 

 

Mar 27 2018

Docx4j and WebSphere 2018

TLDR

Current 3.3.x Docx4j works with WebSphere versions 8.5.5.9 and 9.0.0.5 in WebSphere’s default configuration (tested with IBM Java 8, which is not the default in WebSphere 8.5.5.9).

docx4j 3.3.7 contains an important fix for errorsCount where XLXP2 is in use with fallback JAXBContext of Sun/Oracle or reference implementation (see below for context).

Scope/Assumptions

Our testing was based around the following assumptions:

  • IBM JDK (not Sun/Oracle)
  • IBM JAXB (see below)
  • Xalan is available for use via System.setProperty(“javax.xml.transform.TransformerFactory”, org.apache.xalan.transformer.TransformerImpl)

Out of Scope of testing: OSGi. Others have done some work on OSGi in the past though; see https://github.com/uncleit/docx4j-osgi/blob/master/pom.xml or https://github.com/kimios

JAXB Background

IBM has their own proprietary JAXB implementation. By default, WebSphere uses com.ibm.xml.xlxp2.jaxb, which has the concept of fallback/ MarshallerProxy. The actual implementation it uses is in com.ibm.jaxb.tools.jar.

It is possible to configure WebSphere to instead use the JAXB implementation in the Sun/Oracle JRE, but usually you would not do this if you are using the IBM JDK.  Alternatively, your application could use MOXy JAXB (by including the relevant jars).

Here we tested with WebSphere’s default, namely:

Primary JAXBContext:
bundleresource://138.fwk797973828/com/ibm/xml/xlxp2/jaxb/JAXBContextImpl.class,
Version: 1.6.2-jaxb,
Fallback JAXBContext:
bundleresource://11.fwk797973828/com/ibm/jtc/jax/xml/bind/v2/runtime/JAXBContextImpl.class Build-Id: null

For more information, see https://stackoverflow.com/questions/48700004/does-webspheres-jaxb-marshallerproxy-use-the-reference-implementation

WebSphere has property: com.ibm.xml.xlxp.jaxb.opti.level (see https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.0.0/com.ibm.websphere.nd.doc/info/ae/ae/xrun_jvm.html#com.ibm.xml.xlxp.jaxb.opti.level ):

  • At level=0, optimization methods are not enabled;
  • At level=3 (default), both unmarshalling and marshalling optimization methods are enabled.

In our testing, we used values 0 and 3 (or not set).

WebSphere has several other JAXB related properties which we left at their default settings.

ErrorsCount

Docx4j contains a class JaxbValidationEventHandler, which is responsible for handling unexpected content (both mc:AlternateContent which is common, and certain other errors in an incoming docx).

In the JAXB reference implementation, there is a method shouldErrorBeReported(); see https://github.com/javaee/jaxb-v2/blob/master/jaxb-ri/runtime/impl/src/main/java/com/sun/xml/bind/v2/runtime/unmarshaller/UnmarshallingContext.java#L1350

Previously errors (ie unexpected content) were not ignored if UnmarshallingContext.getInstance().parent.hasEventHandler()

Some time around 2015, JAXB was changed so that after unexpected content has been encountered 10 times (ie in 10 docx parts), the error won’t be reported (ie docx4j’s JaxbValidationEventHandler won’t be invoked, so docx4j doesn’t have the opportunity to deal with the content error, with the result that content is silently dropped).

Recent versions of docx4j work around this, by resetting the error counter, and docx4j 3.3.7 builds on this with an important fix for errorsCount where XLXP2 is in use with fallback JAXBContext of Sun/Oracle or reference implementation

Test Results

With environment WebSphere 9.0.0.4, current docx4j/Plutext releases work well.

With environment: WebSphere 8.5.5.13 (WebSphere 8.5.5.9 upgraded in order to run IBM Java 8),  current docx4j/Plutext releases work well.

(Older Java should also be ok, but was outside the scope of testing)

Methodology Notes

In testing, there are several things to be aware of:

  1. WebSphere might re-use a jar in multiple webapps. In case of unexpected results, ensure you don’t have different versions of the same jar in other webapps, stop the server, clearClassCache, and restart.
  2. If you are looking for JaxbValidationEventHandler log entries but cannot see them, double check that your jar files do not contain another log4j.xml.

Java 2 Security

If you have Java 2 Security enabled in WebSphere, you will need certain permissions enabled in policy settings.

Mar 15 2018

PDF Converter task sizing and auto scaling

With FarGate, you have to specify a task size:

task-sizing

Load testing with JMeter, I have found that 2 vCPU works well for the Task CPU setting.  The minimum Task Memory you can set for 2 vCPU’s is 4GB.  (The PDF Converter doesn’t use that much RAM, so it would be good to be able to specify just 1GB, particularly since FarGate pricing includes a cost per GB)

For my load testing (32 parallel conversions), served by 2 tasks:

JMeter_2-tasks

So, an average of 9.8 sec per conversion (based on a range of documents, some short/quick, others long/slow).

With FarGate, you can set a service to auto-scale, under CPU load or based on incoming requests.

So let’s improve on those response times, by auto-scaling the number of tasks available for processing the incoming PDFs.

How to do this? FarGate tells me my CPU utilization was:

UtilizationCPU

So let’s “update” the service to set auto-scaling to happen at 40%:

auto-scaling-cpu40

Re-running the load test, here are the results:

JMeter_autoscaled

You can see the response time better than halved, and throughput doubled.

At the end of the test, I can see that it auto-scaled to 10 tasks:

tasks-status-scaled-cpu40

Looking at the load balancer target group, you can see it went from 2 tasks to 5 tasks to 10:

healthy hosts

(the test sarted at 23:13 and finished at 11:28; scaling in occurred some time after the test concluded).

You can see from the graph below that the average response times drop as these extra tasks become available.

response-times-over-time

Running the load test one last time, with 8 tasks in place from the start:

JMeter_10-tasks

we have an average response time of 2.2 seconds, and we’re converting 12.48 documents per second.

In summary, configuring the cluster so that each task has 2 vCPUs, and auto-scaling when CPU utilization hits 40%, looks like a good place to start tweaking your own instance.

Mar 12 2018

Using HTTPS on FarGate

This is the second post in a series on scaling the PDF Converter using Amazon’s FarGate service.

In the first post, we got the PDF Converter running across 2 instances behind a load balancer, in under 20 minutes.

Now, we want to use HTTPS.  The Amazon documentation is at https://docs.aws.amazon.com/elasticloadbalancing/latest/application/create-https-listener.html

First, go to the AWS Certificate Manager (ACM): https://console.aws.amazon.com/acm/home?region=us-east-1#/firstrun/ to request a certificate (for your domain).

Now go to your load balancer, and choose “create listener“. Choose HTTPS.  You should see something like:

alb-https-listener

(here I’ve used plutext.com, but obviously you’ll have substituted your own domain).

If/when you click “create”, you’ll probably get a warning saying your security group doesn’t allow HTTPS, so click on the security group to allow traffic on port 443.

We’re not quite there yet.  If you try converting using your load balancer endpoint (something like https://EC2Co-EcsEl-1GY7BNHSDU1HTH-1150934046.us-east-1.elb.amazonaws.com:443), you’ll get an error saying the certificate subject name does not match target host name.

To overcome this, you need to update your DNS records so you have a host with the right name resolving to the load balancer.

The recommended way to do this is to use Amazon’s Route53 DNS.

But just to prove what we’ve done so far works, its enough to put an entry in your /etc/hosts file mapping a host covered by your certificate, to the load balancer’s IP address.  Then:

$ curl -v -X POST --data-binary @HelloWorld.docx -o out.pdf https://fargate.plutext.com:443/v1/00
000000-0000-0000-0000-000000000000/convert
Note: Unnecessary use of -X or --request, POST is already inferred.
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 54.89.45.53...
* Connected to fargate.plutext.com (54.89.45.53) port 443 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 597 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*        server certificate verification OK
*        server certificate status verification SKIPPED
*        common name: *.plutext.com (matched)
*        server certificate expiration date OK
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: #3
*        subject: CN=*.plutext.com
*        start date: Mon, 12 Mar 2018 00:00:00 GMT
*        expire date: Fri, 12 Apr 2019 12:00:00 GMT
*        issuer: C=US,O=Amazon,OU=Server CA 1B,CN=Amazon
*        compression: NULL
* ALPN, server accepted to use http/1.1
> POST /v1/00000000-0000-0000-0000-000000000000/convert HTTP/1.1
> Host: fargate.plutext.com
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Length: 4082
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
0  4082    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0} [4082 bytes data]
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Date: Mon, 12 Mar 2018 05:01:52 GMT
< Content-Type: application/pdf
< Content-Length: 38507
< Connection: keep-alive
< access-control-allow-origin: *
<
{ [16384 bytes data]
100 42589  100 38507  100  4082  16903   1791  0:00:02  0:00:02 --:--:-- 16903
* Connection #0 to host fargate.plutext.com left intact

Now we know it works, you can add a CNAME record at your DNS provider, mapping your chosen host name to the load balancer’s host name.

Remove the entry we added to /etc/hosts, give your CNAME entry time to propogate, then verify the curl command works.