Archive for the ‘docx to PDF’ Category

Office pptx/xlsx/docx to PDF to in docx4j 8.2.3

September 5th, 2020 by Jason

docx4j 8.2.3 facilitates 3 distinct ways to convert Microsoft Word docx documents to PDF. There are also possibilities for converting pptx or xlsx to PDF.

The three approaches:

  • export-fo: the content is converted to XSL FO, and from there, to PDF (or any of the other formats supported by Apache FOP)
  • documents4j: since 8.2.0, use Microsoft Word to do the conversion
  • via-Microsoft-Graph: new in 8.2.3, use java-docx-to-pdf-using-Microsoft-Graph to do the conversion

So which should you choose? The following table covers some of the things you might want to consider:

export-FO Microsoft Graph documents4j
Overview Conversion of docx to XSL FO, then uses Apache FOP to convert to PDF Uses Microsoft’s cloud Uses your Microsoft Office installation 
Fidelity Suitable for simple documents (text, tables, supported image types, header/footers) 100% (Microsoft’s fidelity) 100% (Microsoft’s fidelity)
Suitability simple docx docx, pptx, xlsx docx, xlsx 
License considerations ASL v2 Refer applicable Microsoft cloud terms Refer Microsoft EULA governing your Office install 
(increasingly restricted with each release)
Cost Free Microsoft cloud costs (Microsoft Office)
Confidentiality documents don’t leave your server documents go to Microsoft cloud documents need not leave your servers
Other advantages – Fast XSL FO/PDF templating for high volume PDF creation
– Open source, so can be extended
– Microsoft encourages this approach
– Microsoft cloud handles scalability
– Can update a docx table of contents
– Can convert RTF and binary .doc
Other disadvantages Two step (docx to XSL FO to PDF) processing is slower (except for XSL FO templating) – Dependency on 3rd party cloud
– Currently can’t update docx table of contents
vote to fix
– documents4j doesn’t support pptx
– Not supported by Microsoft

Scaling the PDF Converter with AWS Fargate

March 12th, 2018 by Jason

This is a walkthrough of deploying the PDF Converter on Amazon’s FarGate.

What is Fargate?  New since November 2017,  its an easy way of deploying containers on AWS ECS.  You don’t have to manage the underlying EC2 instances, and the wizard takes care of the setup, so you can be up and running in less than 20 mins!

With FarGate, you make a “cluster” which you can easily size to suit a known conversion volume, or have it auto-scale with load.  Largely thanks to Docker!

This walkthrough assumes you already have an AWS login.

To getting things working:

  1. there’s 4 steps in Amazon’s firstRun wizard: https://console.aws.amazon.com/ecs/home?region=us-east-1#/firstRun
  2. then you configure the health check path

But first, check things are configured correctly for ECS in your Amazon account.  Since FarGate currently only works in N.Virginia, visit https://console.aws.amazon.com/ecs/home?region=us-east-1#/getStarted

ECS FirstRun Wizard

If you don’t already see the “Getting Started” wizard pictured below, click https://console.aws.amazon.com/ecs/home?region=us-east-1#/firstRun (this is easier than “create new cluster” at https://console.aws.amazon.com/ecs/home?region=us-east-1#/clusters/create/new since it also creates a Service and Task, but more importantly, your load balancer).

fargate-firstrun-step1

In the “Container definition” section, click the “configure” button on the “custom” image.

Type the following in image: plutext/plutext-document-services:2.1-0, and set the other values as per the image below:

 

container-settings-dockerhub

Next, in “Task definition”, edit the task definition name, to say: pds-task-definition

Click next.

Service

On the “service” screen, click “edit” to set the number of tasks to 2, and choose “Application Load Balancer”.service

Click next.

Cluster

On this screen, just change the cluster name to: plutext-document-services

When you click next, the review screen should show:

review

Click “Create”.  The wizard will perform various tasks; it might take 3 or 4 mins.

When it is done, you should see:

preparing-service

Click the “view service” button.

Health Check

You need to set the health check path in your load balancer.  (Unfortunately, FarGate currently doesn’t populate this from the HEALTHCHECK statement in your Dockerfile)

So in your cluster, click your service, where you’ll see the load balancer target group:

cluster-service

 

Click that.

Now, you’re in your load balancer, where you can click “edit health check” and enter path:  /v1/00000000-0000-0000-0000-000000000000/ping

Result should be:

health-check

Before you go back to your service, click on the load balancer itself, and make a note of its DNS name.   You’ll see the host name there in the basic configuration:

alb-hostname

 

Now if you go back to your service, on the “tasks” tab, you should see:

tasks-status

ie “RUNNING”

Try it out!

To convert a document, you need the DNS host name of the load balancer you made a note of above.  Now you can test with something like:

curl -v -X POST –data-binary @HelloWorld.docx -o out.pdf http://EC2Co-EcsEl-1N1ULP12K5TGG-2127307716.us-east-1.elb.amazonaws.com:80/v1/00000000-0000-0000-0000-000000000000/convert

Check for “200 OK” and try opening out.pdf.

Next steps

In our next post, we’ll configure HTTPS, and in the one after that, we’ll add a license key.