Page 1 of 1

Memory usage optimization

PostPosted: Thu Oct 03, 2013 9:10 pm
by antoine
Hi,

I'm generating documents using the following process:

- I have a docx template with content controls bindings and an emtpy custom xml part
- I marshall the object graph using XStream into a DomWriter
- I load the docx using LoadFromZipNG
- I inject the serialized object graph into the WordMLDocument using a CustomXMLDataStorage and calling setDocument on it with the generated DOM Document.
- I apply bindings using BindingHandler.applyBindings
- I save the document

I have many images bindings and when I inject too much images I get out of memory exceptions. I'm looking for a strategy to avoid these exceptions and to better handle the generation process without having all images loaded into the memory at a time.
What would you recommand for this? I have seen a post mentioning eXist DB (data-binding-java-f16/improve-performance-and-memory-usage-with-large-templates-t1138.html#p3910) but I don't really understand the suggested approach.
What I was thinking of would be generating the serialized XML directly into a file on the file system by configuring XStream, so that the marshalling step could work without loading all images in the memory. But how can I inject this into the document and apply the bindings whithout loading the whole XML data at a time?

Thanks for your help.
Antoine

Re: Memory usage optimization

PostPosted: Thu Oct 03, 2013 9:26 pm
by jason
docx4j current dev code contains UnzippedPartStore, which can be used to save a docx unzipped in the file system.

Assuming the image parts are what is consuming the most memory, perhaps BinaryPart could be modified so that images are saved to the file system as they are created, then cleared from memory.

When you save the rest of the document (via modified code), you'd have the whole thing unzipped on the file system.

Then you just need to zip it up, and rename to docx.

Re: Memory usage optimization

PostPosted: Fri Oct 04, 2013 12:44 am
by antoine
Thanks for your quick reply.

My images are injected through XML in the custom XML part, so they are encoded in Base64. How could I store them in a BinaryPart? Is it done automatically when applying the bindings?

Does the WordMLPackage loads into memory the whole content of all parts?

Regards,
Antoine

Re: Memory usage optimization

PostPosted: Fri Oct 04, 2013 7:37 am
by jason
antoine wrote:s it done automatically when applying the bindings?

Does the WordMLPackage loads into memory the whole content of all parts?


Yes and yes.

BinaryPart does contain:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
        /**
         * Store buffer thru soft reference so it could be
         * unloaded by the java vm if free memory is low.
         */

        private Reference<ByteBuffer> bbRef = null;
 
Parsed in 0.031 seconds, using GeSHi 1.0.8.4


though this hasn't been reviewed for a while.

Re: Memory usage optimization

PostPosted: Sat Oct 05, 2013 12:40 am
by antoine
OK, this is good that the BinaryPart only maintains a soft reference to the image data.

So, I have a docx template + an XML file containing the data to inject in the custom xml part. If I understand it right I should:

1) Unzip the docx (using the UnzippedPartStore)
2) Inject the XML data into the custom XML part
3) Modify BinaryPart so that they are saved in the unzipped docx once created while the bindings are applied and can then be garbage collected thanks to the soft reference
4) Rezip the docx

Is this right? The point 2. is still unclear to me. How can I inject/apply the XML data without loading it into memory?

Re: Memory usage optimization

PostPosted: Sun Oct 06, 2013 7:57 pm
by jason
I thought I'd try out the approach I suggested. (Your step 3)

Following https://github.com/plutext/docx4j/commi ... 7ac5e1a5e8 there is a sample ContentControlsApplyBindingsIncrementalSave.java:

https://github.com/plutext/docx4j/blob/ ... lSave.java

I'll left the last step of zipping up out.

The other part of making it work, is https://github.com/plutext/docx4j/blob/ ... rXSLT.java at line 772:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
                // In certain circumstances, save it immediately
                if (wmlPackage.getTargetPartStore()!=null
                                && wmlPackage.getTargetPartStore() instanceof UnzippedPartStore) {
                        log.debug("incrementally saving " + imagePart.getPartName().getName());  
                        ((UnzippedPartStore)wmlPackage.getTargetPartStore()).saveBinaryPart(imagePart);
                        // remove it from memory
                        ByteBuffer bb = null;
                        imagePart.setBinaryData(bb);//new byte[0]);
                        imagePart.setImageInfo(null); // this might help as well
                }
 
Parsed in 0.030 seconds, using GeSHi 1.0.8.4


The sample doesn't cover "Inject the XML data into the custom XML part"; normally you'd do that as shown in
https://github.com/plutext/docx4j/blob/ ... geXML.java
The current sample uses Docx4J facade; for the non facade version, see https://github.com/plutext/docx4j/blob/ ... geXML.java

I haven't tackled the question of how to avoid injecting the XML file into memory. If that proves necessary, you could create a custom implementation of interface CustomXmlDataStorage. As to the actual implementation, try Googling 'java xslt huge memory', or maybe this is where eXist comes in?

How big is your XML file? How many images, and what is their average size?

Re: Memory usage optimization

PostPosted: Thu Oct 10, 2013 8:19 pm
by antoine
Thank you, I'll try this.