Page 1 of 1

OutOfMemoryError

PostPosted: Fri Sep 25, 2009 2:51 pm
by holgerschlegel
Hi,

I'm currently trying to load a 190MB docx file using docx4j. Even giving the vm 1.5GB memory loading fails with an OutOfMemoryError.
Does doc4j loads all parts on load of the package or does it load the parts on demand?

If all parts are loaded into memory with the package, could that be changed do load on demand using a memory sensitive cache for loaded parts?

Regards
Holger

Re: OutOfMemoryError

PostPosted: Sat Sep 26, 2009 1:24 am
by jason
Hello Holger

holgerschlegel wrote:Does doc4j loads all parts on load of the package or does it load the parts on demand?


docx4j loads all parts of the package; see LoadFromZipFile and LoadFromZipFileNG.

For starters, you'll be better off with the old LoadFromZipFile, since that doesn't use an intermediate step of copying all the parts into byte arrays (which it does as part of supporting loading a docx from an input stream).

holgerschlegel wrote:If all parts are loaded into memory with the package, could that be changed do load on demand using a memory sensitive cache for loaded parts?


Yes, with relatively few changes you could set things up so that, for example, the main document part, its rels, and say the styles part and numbering parts were loaded into memory.

There is a class Parts (org.docx4j.openpackaging.parts.Parts) which stores the loaded parts. Depending on exactly what you want to do, you could set it up with a reference to the original zip file, so it could find them on demand (SaveToZipFile will need all the parts).

The other thing you could look at is saving (for example just the changed Main Document Part) back to the original file (or a copy of it). I haven't looked to see whether java's Zip implementation supports this; if it doesn't there are several other implementations out there, one of which might.

cheers .. Jason

Re: OutOfMemoryError

PostPosted: Mon Sep 28, 2009 5:01 pm
by holgerschlegel
I've created a small patch that solved the problem for me.

The class BinaryPart has been modified to load the part data ByteBuffer on demand. To do that, the new method setBinaryDataRef(zipFileName, resolvedPartUri) has to be called on the part instead of setBinaryData(is). The method getBuffer() performs the load if required. The ByteBuffer is hold inside a SoftReference (instance variable bbRef) to allow the java machine to discard it if free memory is low.

To use the new feature I've also made small changes in the class LoadFromZipFile to set the reference data instead of direct loading the data of binary parts. Because loading docx from files is the only thing I need, I've not changed the class LoadFromZipFileNG or any other loader. I've not tested to save a docx package to file, but due to the way the changes has been implemented in the BinaryPart class, no change should be required for save.

If suitable, feel free to add the patch on the basis of the document docx4j_IndividualContributor.docx.

Regards
Holger

Edit: Maybe adding a configuration switch for the changed behavior to the LoadFromZipFile class would be a good idea.

Re: OutOfMemoryError

PostPosted: Thu Oct 01, 2009 2:03 pm
by jason
holgerschlegel wrote:If suitable, feel free to add the patch on the basis of the document docx4j_IndividualContributor.docx.

Regards
Holger

Edit: Maybe adding a configuration switch for the changed behavior to the LoadFromZipFile class would be a good idea.


Thanks Holger, this is applied as r909, inc your suggested configuration switch, which I've called setConserveMemory, and which defaults to false (r910).