Archive for the ‘Uncategorized’ Category

markdown to docx

July 1st, 2018 by Jason

I was looking for a way to convert swagger2markup markdown output to docx using Java (as opposed to Pandoc).

I found flexmark-java which describes itself as:

Java Markdown parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.

Its MD to DOCX is in flexmark-docx-converter module, which happily, uses docx4j under the covers.

I tried it, and found it works nicely, the exception being table rendering if you open the resulting docx in LibreOffice (Word automatically sets the column widths, but in LibreOffice 5.3 Writer, the column widths are bad and painful to fix).   The underlying issue is that markdown doesn’t specify column widths, and in docx4j, we don’t provide an algorithm to help the user set sensible values.

Development of flexmark-docx-converter was  sponsored by Johner Institut GmbH (medical device documentation)

The documentation for flexmark-docx-converter is at https://github.com/vsch/flexmark-java/wiki/Docx-Renderer-Extension

That page says:

Word does not handle inserted HTML very well.

It would be quite straightforward to use docx4j-ImportHTML to work around that :-)

 

Docx4j and WebSphere 2018

March 27th, 2018 by Jason

TLDR

Current 3.3.x Docx4j works with WebSphere versions 8.5.5.9 and 9.0.0.5 in WebSphere’s default configuration (tested with IBM Java 8, which is not the default in WebSphere 8.5.5.9).

docx4j 3.3.7 contains an important fix for errorsCount where XLXP2 is in use with fallback JAXBContext of Sun/Oracle or reference implementation (see below for context).

Scope/Assumptions

Our testing was based around the following assumptions:

  • IBM JDK (not Sun/Oracle)
  • IBM JAXB (see below)
  • Xalan is available for use via System.setProperty(“javax.xml.transform.TransformerFactory”, org.apache.xalan.transformer.TransformerImpl)

Out of Scope of testing: OSGi. Others have done some work on OSGi in the past though; see https://github.com/uncleit/docx4j-osgi/blob/master/pom.xml or https://github.com/kimios

JAXB Background

IBM has their own proprietary JAXB implementation. By default, WebSphere uses com.ibm.xml.xlxp2.jaxb, which has the concept of fallback/ MarshallerProxy. The actual implementation it uses is in com.ibm.jaxb.tools.jar.

It is possible to configure WebSphere to instead use the JAXB implementation in the Sun/Oracle JRE, but usually you would not do this if you are using the IBM JDK.  Alternatively, your application could use MOXy JAXB (by including the relevant jars).

Here we tested with WebSphere’s default, namely:

Primary JAXBContext:
bundleresource://138.fwk797973828/com/ibm/xml/xlxp2/jaxb/JAXBContextImpl.class,
Version: 1.6.2-jaxb,
Fallback JAXBContext:
bundleresource://11.fwk797973828/com/ibm/jtc/jax/xml/bind/v2/runtime/JAXBContextImpl.class Build-Id: null

For more information, see https://stackoverflow.com/questions/48700004/does-webspheres-jaxb-marshallerproxy-use-the-reference-implementation

WebSphere has property: com.ibm.xml.xlxp.jaxb.opti.level (see https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.0.0/com.ibm.websphere.nd.doc/info/ae/ae/xrun_jvm.html#com.ibm.xml.xlxp.jaxb.opti.level ):

  • At level=0, optimization methods are not enabled;
  • At level=3 (default), both unmarshalling and marshalling optimization methods are enabled.

In our testing, we used values 0 and 3 (or not set).

WebSphere has several other JAXB related properties which we left at their default settings.

ErrorsCount

Docx4j contains a class JaxbValidationEventHandler, which is responsible for handling unexpected content (both mc:AlternateContent which is common, and certain other errors in an incoming docx).

In the JAXB reference implementation, there is a method shouldErrorBeReported(); see https://github.com/javaee/jaxb-v2/blob/master/jaxb-ri/runtime/impl/src/main/java/com/sun/xml/bind/v2/runtime/unmarshaller/UnmarshallingContext.java#L1350

Previously errors (ie unexpected content) were not ignored if UnmarshallingContext.getInstance().parent.hasEventHandler()

Some time around 2015, JAXB was changed so that after unexpected content has been encountered 10 times (ie in 10 docx parts), the error won’t be reported (ie docx4j’s JaxbValidationEventHandler won’t be invoked, so docx4j doesn’t have the opportunity to deal with the content error, with the result that content is silently dropped).

Recent versions of docx4j work around this, by resetting the error counter, and docx4j 3.3.7 builds on this with an important fix for errorsCount where XLXP2 is in use with fallback JAXBContext of Sun/Oracle or reference implementation

Test Results

With environment WebSphere 9.0.0.4, current docx4j/Plutext releases work well.

With environment: WebSphere 8.5.5.13 (WebSphere 8.5.5.9 upgraded in order to run IBM Java 8),  current docx4j/Plutext releases work well.

(Older Java should also be ok, but was outside the scope of testing)

Methodology Notes

In testing, there are several things to be aware of:

  1. WebSphere might re-use a jar in multiple webapps. In case of unexpected results, ensure you don’t have different versions of the same jar in other webapps, stop the server, clearClassCache, and restart.
  2. If you are looking for JaxbValidationEventHandler log entries but cannot see them, double check that your jar files do not contain another log4j.xml.

Java 2 Security

If you have Java 2 Security enabled in WebSphere, you will need certain permissions enabled in policy settings.

PDF Converter task sizing and auto scaling

March 15th, 2018 by Jason

With FarGate, you have to specify a task size:

task-sizing

Load testing with JMeter, I have found that 2 vCPU works well for the Task CPU setting.  The minimum Task Memory you can set for 2 vCPU’s is 4GB.  (The PDF Converter doesn’t use that much RAM, so it would be good to be able to specify just 1GB, particularly since FarGate pricing includes a cost per GB)

For my load testing (32 parallel conversions), served by 2 tasks:

JMeter_2-tasks

So, an average of 9.8 sec per conversion (based on a range of documents, some short/quick, others long/slow).

With FarGate, you can set a service to auto-scale, under CPU load or based on incoming requests.

So let’s improve on those response times, by auto-scaling the number of tasks available for processing the incoming PDFs.

How to do this? FarGate tells me my CPU utilization was:

UtilizationCPU

So let’s “update” the service to set auto-scaling to happen at 40%:

auto-scaling-cpu40

Re-running the load test, here are the results:

JMeter_autoscaled

You can see the response time better than halved, and throughput doubled.

At the end of the test, I can see that it auto-scaled to 10 tasks:

tasks-status-scaled-cpu40

Looking at the load balancer target group, you can see it went from 2 tasks to 5 tasks to 10:

healthy hosts

(the test sarted at 23:13 and finished at 11:28; scaling in occurred some time after the test concluded).

You can see from the graph below that the average response times drop as these extra tasks become available.

response-times-over-time

Running the load test one last time, with 8 tasks in place from the start:

JMeter_10-tasks

we have an average response time of 2.2 seconds, and we’re converting 12.48 documents per second.

In summary, configuring the cluster so that each task has 2 vCPUs, and auto-scaling when CPU utilization hits 40%, looks like a good place to start tweaking your own instance.

Using HTTPS on FarGate

March 12th, 2018 by Jason

This is the second post in a series on scaling the PDF Converter using Amazon’s FarGate service.

In the first post, we got the PDF Converter running across 2 instances behind a load balancer, in under 20 minutes.

Now, we want to use HTTPS.  The Amazon documentation is at https://docs.aws.amazon.com/elasticloadbalancing/latest/application/create-https-listener.html

First, go to the AWS Certificate Manager (ACM): https://console.aws.amazon.com/acm/home?region=us-east-1#/firstrun/ to request a certificate (for your domain).

Now go to your load balancer, and choose “create listener“. Choose HTTPS.  You should see something like:

alb-https-listener

(here I’ve used plutext.com, but obviously you’ll have substituted your own domain).

If/when you click “create”, you’ll probably get a warning saying your security group doesn’t allow HTTPS, so click on the security group to allow traffic on port 443.

We’re not quite there yet.  If you try converting using your load balancer endpoint (something like https://EC2Co-EcsEl-1GY7BNHSDU1HTH-1150934046.us-east-1.elb.amazonaws.com:443), you’ll get an error saying the certificate subject name does not match target host name.

To overcome this, you need to update your DNS records so you have a host with the right name resolving to the load balancer.

The recommended way to do this is to use Amazon’s Route53 DNS.

But just to prove what we’ve done so far works, its enough to put an entry in your /etc/hosts file mapping a host covered by your certificate, to the load balancer’s IP address.  Then:

$ curl -v -X POST --data-binary @HelloWorld.docx -o out.pdf https://fargate.plutext.com:443/v1/00
000000-0000-0000-0000-000000000000/convert
Note: Unnecessary use of -X or --request, POST is already inferred.
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 54.89.45.53...
* Connected to fargate.plutext.com (54.89.45.53) port 443 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 597 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*        server certificate verification OK
*        server certificate status verification SKIPPED
*        common name: *.plutext.com (matched)
*        server certificate expiration date OK
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: #3
*        subject: CN=*.plutext.com
*        start date: Mon, 12 Mar 2018 00:00:00 GMT
*        expire date: Fri, 12 Apr 2019 12:00:00 GMT
*        issuer: C=US,O=Amazon,OU=Server CA 1B,CN=Amazon
*        compression: NULL
* ALPN, server accepted to use http/1.1
> POST /v1/00000000-0000-0000-0000-000000000000/convert HTTP/1.1
> Host: fargate.plutext.com
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Length: 4082
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
0  4082    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0} [4082 bytes data]
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Date: Mon, 12 Mar 2018 05:01:52 GMT
< Content-Type: application/pdf
< Content-Length: 38507
< Connection: keep-alive
< access-control-allow-origin: *
<
{ [16384 bytes data]
100 42589  100 38507  100  4082  16903   1791  0:00:02  0:00:02 --:--:-- 16903
* Connection #0 to host fargate.plutext.com left intact

Now we know it works, you can add a CNAME record at your DNS provider, mapping your chosen host name to the load balancer’s host name.

Remove the entry we added to /etc/hosts, give your CNAME entry time to propogate, then verify the curl command works.

JAXB RI new home on the web

June 30th, 2017 by Jason

https://github.com/javaee/jaxb-v2 is JAXB’s new home (jaxb.java.net now redirects there).

Issue tracker is at https://github.com/javaee/jaxb-v2/issues (though it seems some existing issues didn’t get migrated over, for example https://github.com/gf-metro/jaxb/issues/22)

I’m not sure where you are supposed to get official binary releases from; Maven I guess?

http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.sun.xml.bind%22%20AND%20a%3A%22jaxb-ri%22

http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22javax.xml.bind%22%20AND%20a%3A%22jaxb-api%22

Discussion group is apparently at https://javaee.groups.io/g/metro, but first you need to join the parent group https://javaee.groups.io/g/javaee

https://javaee.github.io/ has details of other the projects migrated from Java.net to GitHub

Docx4j Helper Word AddIn: new version v3.3.0

May 11th, 2016 by Jason

We’ve just published a new version of our Helper AddIn for Word; you can download it from  http://www.plutext.com/dn/downloads/1472868282152/Docx4j_Helper_3-3-1-03.exe (link updated 3 Sept 2016).

Here’s what it looks like:

helper-ribbon

Its new features:

  • code generation from your selection (no need to use the webapp interface)
  • generate PDF output using Plutext’s commercial PDF Converter (either a locally installed instance, or http://converter-eval.plutext.com)
  • document sanitisation/anonymisation so you can safely publish it for support purposes
  • easy docDefaults manipulation

I’ll run through these one by one.

code generation from your selection

With the old version, you launched a local version of the docx4j webapp to generate code.

You can still do that (click the Load Helper button, then Parts List); its a useful way to see all the parts in your docx.

With this version, if you click the “Inspect selection” button, you’ll see the corresponding XML:

editor-hello-world

What’s new is the “Java” button.  If you click that, it’ll generate corresponding code, and display that in your web browser:

code-part1

:

code-part2

 

generate PDF output

The “PDF” button generates a PDF.  Either of your whole document, or your selection.

It uses our commercial PDF Converter so its probably most useful:

  1. if you want to evaluate that, or
  2. have found something which doesn’t convert correctly, and want technical support

It can use either a locally installed instance, or http://converter-eval.plutext.com

You configure that with the “Config” button.

The generated PDF will open in whatever Windows opens PDFs with for you.

document sanitisation/anonymisation

The idea of the “anonymise” button is to make it easy for you to email/publish a docx (or PDF) for support purposes, without giving away sensitive info.

Again, if nothing is selected, it’ll do your whole docx.  Otherwise, just your selection.

The results will be saved to a temporary local docx (so your source docx is unaffected), then opened.

If there is anything potentially sensitive the code can’t remove, it’ll let you know.

The code which does this is at https://github.com/plutext/docx4j/tree/master/src/main/java/org/docx4j/anon

easy docDefaults manipulation

Its sometimes useful to see/edit your doc defaults.

docdefaults

Your changes will open in a new temp docx.

For example, try changing font size (w:sz) from 22 to 42.

You can also make changes in Word’s Paragraph and Font dialogs by pressing the “Set as Default” button, then look here to see how that translates into XML.  Without this, its hard to be sure whether you’ve changed your default styles, or docDefaults!

Feedback and Comments

 

Any feedback, comments, requests for new features etc, please post in http://www.docx4java.org/forums/docx4jhelper-addin-f30/   Alternatively, Plutext customers can email support.

PDF/A-2b compliant Word to PDF

April 13th, 2016 by Jason

Plutext’s commercial PDF Word/docx Converter now produces fully PDA/A-2b compliant PDF output.

We say this having tested its output using http://verapdf.org/ “a purpose-built, open source, file-format validator covering all PDF/A parts and conformance levels.”

You can try our PDF Converter now, at http://converter-eval.plutext.com/

Aspose.Confusion in Words

September 26th, 2015 by Jason

In June, Aspose’s Shoaib Khan published a blog post purporting to cover features available in Aspose.Words for Java but not docx4j.

It is either breathtaking or amusing in its inaccuracy, depending on whether you think it was born of deceit or ineptitude.  Either way, its a caution to anyone considering drinking the Aspose.Kool-Aid!

Here I’ll go through his claims one by one.

As a general comment though, it is worth remembering that with docx4j, you can do pretty much anything the docx/pptx/xlsx file formats allow.  If docx4j doesn’t have a high level API for something you want to do, you can always implement it yourself, thanks to docx4j’s lower level JAXB-based APIs.  And docx4j is real ASLv2 open source, so you can use the source Luke!

Without further ado…

Set Page Borders.
Here Aspose seems to be talking about section properties (ie margins etc).

Shoaib implies you can’t control these in docx4j.  Of course you can!  You can add or remove sections, or modify the settings of an existing section.

Track Changes in Documents.
Their AcceptAllRevisions method is said to be similar to Word’s “Accept All Changes”.

Docx4j doesn’t provide a high level API for doing this (since users haven’t asked for it), but a user could implement this for his/herself easily enough using XSLT or docx4j’s TraversalUtil.  You could start with this XSLT

Using Control Characters.
This example is a bit bizarre, because in a docx, specific elements w:br and w:cr are used for line breaks;  OpenXML follows the usual XML rules for whitespace.

Split Tables.
This example shows the steps a user would follow to split a table into two.  Basically, clone the existing table to make a new table with the same properties, then move rows from the first table to the second.

Of course you can do the same thing in docx4j!

Repeat Table Header Rows on Pages.
This example is just about setting the header row property.  See http://stackoverflow.com/questions/14605264/repeat-table-header

Clone Documents.
Cloning a document.  Aspose suggest you can’t do this with docx4j? WTF?! OpcPackage’s clone() method has been there since 2.7.1

docx4j also includes code for making partial copies, where less than a full clone is required.

Moving the Cursor in Document.
OpenXML is of course XML, which is hierarchical.  docx4j uses JAXB to give you an object model representation of that.  The hierarchical structure is basically nested lists.  Lots of stuff boils down to finding a position in a list, and then inserting or deleting etc using the Java collections API.  To find that list/position, you’d typically use docx4j’s powerful traversal functionality.   Or you can use XPath (in JAXB, the objects are bound to the underlying XML).

Aspose has some notion of cursor position, so you can move to the start or end of the document.  This may appeal to people with a VBA background, but in practice it is of little use.

Protect Documents.
Whilst it is true that in docx4j 3.2.0 there was no high level API for the functionality Microsoft Word groups under Protect Document (Mark as Final, Encrypt with Password, Restrict Editing etc), this is available now in the 3.3.0 previews:

Working with Digital Signatures.
As with the other Protect Document features, there was no high level API in 3.2.0.  That’s not to say you couldn’t do it, but there’s a nice API for this in the commercial Enterprise Ed. (forthcoming v3.3.0)

Check Format Compatibility.
This seems to be restricted to knowing what type of document you are working with; Aspose says it doesn’t validate the file format.

Per the specs, an OPC package has a content type.  In docx4j, docx/dotx/docm etc are all represented by WordprocessingMLPackage, but you can distinguish between them by calling getContentType(); the value will be one of:

  • WORDPROCESSINGML_DOCUMENT
  • WORDPROCESSINGML_DOCUMENT_MACROENABLED
  • WORDPROCESSINGML_TEMPLATE
  • WORDPROCESSINGML_TEMPLATE_MACROENABLED

Load Text File.
Whoopey do.  Apparently you can import plain text using Aspose’s expensive software!

Of course that is trivial with docx4j.  Docx4j also supports converting altChunks to native WordML content.  For XHTML altChunks, you need docx4j-ImportXHTML;  support for altChunks of type docx is an Enterprise level feature.

Specify Default Fonts.
The way the default font works in WordprocessingML is more complicated than you might expect, in that there are a few different ways you could affect it (via the theme part or the styles part).

That said, in real life, this doesn’t tend to be a problem.  With docx4j, you can easily set  w:docDefaults/w:rPrDefault/w:rPr/w:rFonts in your styles part, if you want to.

Working with Tables:
Autofit Setting to Tables.
This is just about the tblLayout setting: w:tblPr/w:tblLayout/@w:type, which you can access via TblPr’s get/setTblLayout

Joining Tables in Document.
I can’t recall anyone ever asking for docx4j to provide a high level API to do this, but it could be added.  In the meantime, docx4j allows to you to do anything with tables which the file format allows, including joining tables.

Mail Merge
Mail Merge from XML Data Source.
docx4j provides a high level API for working with legacy MERGEFIELD fields.

If you wanted to fill those fields with data from XML, you could do that easily enough.

Where docx4j really shines though, is in its support for content control data binding.  In that approach, introduced by Microsoft in 2007, you have a bidirectional XPath mapping between content controls in the document, and an XML file.

If you are working with XML, and not forced to work with legacy MERGEFIELDs for some reason, content control data binding is the way to go.

Off topic: Eclipse’s maven from a command line?

June 16th, 2015 by Jason

You’ve installed Eclipse.

Eclipse includes maven (m2e).

Can you use that Maven from outside Eclipse, or do you need to install maven again/separately?

It turns out you can use it.  Whether its worth the effort or not is another question…

You launch maven using the plexus classworlds launcher.

That needs a config file.

The config file (call it m2.conf) contains something like:

main is org.apache.maven.cli.MavenCli from plexus.core

set maven.home default /home/jason/.m2

[plexus.core]
load /home/jason/eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/*.jar

With that, from a project dir, the following is the equivalent of ‘mvn install’:

java -cp ../eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/plexus-classworlds-2.5.1.jar:../eclipse/plugins/org.slf4j.api_1.7.2.v20121108-1250.jar  “-Dclassworlds.conf=m2.conf” org.codehaus.plexus.classworlds.launcher.Launcher install

You could make a shell script to do that.  And you could base the shell script on the one included in the maven distribution.  Or more sensibly, you’d just download  install and use  maven proper.

If you need it.  Right clicking on your project in Eclipse then Run As gives you a handy UI for maven-related stuff:

Finally, note it is  possible to avoid the plexus classworlds launcher by invoking MavenCli directly:

java -cp “../eclipse/plugins/org.slf4j.api_1.7.2.v20121108-1250.jar:../eclipse/plugins/org.eclipse.m2e.maven.runtime_1.5.1.20150109-1819/jars/*” org.apache.maven.cli.MavenCli install

If you are going to use that, you might wonder about setting maven.home with -Dmaven.home=/your/path/to/where immediately after the classpath

Content controls for business data connectivity

January 20th, 2015 by Jason

Sometimes, Word is a natural way for people to interact with back end applications (eg SAP).

This is particularly so when:

  • business data will be output to a Word document,
  • the user is more familiar with Word than the other system,
  • certain data updates may be required (and are permitted)

So maybe there are 4 high level categories:

  • apps which support commercial transactions (a recipient will receive a Word document), eg
    • employment onboarding (letter of employment)
    • invoicing
  • apps with a reporting component:  is the report format a natural interface if it was made bi-directional?
  • workflow/BPM systems which present documents (work orders, proposals, approvals etc)
  • policy/procedures in regulated industries, where a worker must follow a series of steps.  Can present that in a docx; the worker can tick the steps off as they do them
    • related training scenarios?

Microsoft had an emphasis on what they then called “Office Business Applications” back around the Office 2007 launch.  Fast forward to today, and “business connectivity services” are part of SharePoint.

But you can achieve the same sort of thing without SharePoint, using docx4j and data bound content controls.

Once you have your back end data in an XML format (and there are many tools/techniques to help with this), you can use content controls to bind what the user sees in Word to elements in that XML.

The beauty of it is that the binding is bi-directional, so if the user edits the document, the XML is updated (ie it stays in synch).

After the user has made their edit, you can update the back end application.  Typically, you’d do this after they saved & closed the document (ie outside Word, using docx4j), but you could also do it from within Word (a less good approach, but still, an option).

What if there is some data which the user shouldn’t be able to edit?  You simply lock the content control to prevent editing.

To quickly try out this approach, put together some sample XML, then upload it as explained here, to get a docx you can experiment with.

We’d loved to hear about how you might use this approach?