Page 1 of 1

Improve performance and memory usage with large templates

PostPosted: Wed Jul 11, 2012 3:13 am
by olabrosse
Hi,

I've just posted a question on StackOverflow and thought I'd give a heads-up in here. Feel free to answer on StackOverflow for some rep. :)

http://stackoverflow.com/questions/11417390/how-to-improve-docx4j-performance-and-memory-usage-with-large-template-and-numer

We're developing an Eclipse application that allows exporting large EMF models to Word format using docx4j. The data can contain many rendered graphs in PNG format, and the Word template can be pretty large, with multiple repeat and conditional sections that include multiple picture content controls.

The specific problem we have is that the applyBindings() process takes too long (over an hour), partly due to memory limitations.

Is there a proper way to deal with such a situation? Maybe using a different JAXB implementation or chunking the workload?

Thanks in advance for any help with this matter.


-Olivier

Re: Improve performance and memory usage with large template

PostPosted: Wed Jul 11, 2012 8:51 am
by jason
Hello Olivier

Sounds like an interesting application, but "over an hour" .. oh my! For my use cases, the binding process is a couple of seconds or less.

Can you give us a sense of the number of pages in the Word document; number of content controls?

If you could make one of the documents available to me (including CustomXML) that would obviously make it easier to see where the time is going.

Some thoughts in the interim:

- I have docx4j working with the moxy JAXB implementation, but the code is not committed yet, and I'm not sure how its performance/memory usage compares to the reference implementation. Unless the problem is during marshalling/unmarshalling, it shouldn't make any difference, since it is the generated objects which take up memory (eg org.docx4j.wml.*) and these don't change with JAXB implementation.

- the docx could be unzipped/shredded into eXist XML database, and then docx4j could work with a bit at a time. The first step towards this is a eXist proc which unzips the docx when it is PUT, and re-zips it on GET. eXist already has zip and unzip, so that shouldn't be hard. If that much is performant, then we could look at a binding handler which works with that.

cheers .. Jason

Re: Improve performance and memory usage with large template

PostPosted: Fri Jul 13, 2012 8:39 am
by olabrosse
Hi Jason,

In our use cases we can have dozens of repeat content controls containing conditionals and other repeats, easily growing over a few thousand controls. The picture controls alone go above a 500 count. The end result is a 50+ page document.

One thing is for sure, and pretty obvious too, is that the biggest part of the job is binding the pictures.

Oh, and another thing: it only takes over an hour on slower computers. On my Core i7 laptop it took around 10-15 minutes, but that's still a good 5-10 times slower than we'd like.

I'm working on generating a test document for you to take a look. I will email it to you soon, or tomorrow morning in the worst case.

Thanks!

-Olivier

Re: Improve performance and memory usage with large template

PostPosted: Fri Jul 13, 2012 9:06 am
by olabrosse
Actually, as I'm generating the test document, I can see that it is not the applyBindings() method that takes time. It is the creation of the RelationshipsPart elements for the pictures. At 500+ pictures and half a second per picture, you can see how this can easily take over 5 minutes.

I'll still email you the test document, in case it has something to do with the structure of things. If the latter turns out to be true, maybe you can advise me on how to improve my process.

Cheers!

Re: Improve performance and memory usage with large template

PostPosted: Fri Jul 13, 2012 9:12 am
by olabrosse
Ok, I feel stupid now. I had made code modifications that prevented hot-swapping, and it ended up doing the applyBindings() call anyways.

It is indeed the applyBindings() process that takes very long. Sorry for the confusion.

Re: Improve performance and memory usage with large template

PostPosted: Tue Jul 17, 2012 4:03 pm
by jason
On my development machine, with your docx the four steps take the
following time (in ms, indicative figures from one run):

OpenDoPEHandler: 31303
OpenDoPEIntegrity: 3626
BindingHandler.applyBindings: 180726
RemovalHandler: 799

As you have noted, BindingHandler.applyBindings is the slowest (180
seconds), so I've looked at that step only.

When an image is added, we work out what sort of image it is (png
etc), and its dimensions.

org.apache.xmlgraphics.image.loader.ImageManager.getImageInfo is used
to do this, and to invoke that, a temp file is required.

It turns out that this is the bottleneck.

Avoiding this, the applyBindings step takes 12 seconds (15x faster).
I demonstrated this by assuming the images were all PNG, and that
itheir dimensions were known, then running the process.

I did a proof of concept of a further optimisation today which can do
that step for your sample docx in 6 seconds (~ 30x faster than 180
sec).

To avoid the ImageManager.getImageInfo bottleneck we need a better way of working out image type and dimensions, either by optmising ImageManager.getImageInfo to make it faster (including by
avoiding the temp file), or using some alternative code.