Page 1 of 1

Heap space use on load docx

PostPosted: Wed Dec 02, 2020 6:31 am
by guille
Hi

We are usind docx4j for replacing variable replacement in docx files. Until now has worked fine in production for several months. But today we hit this error:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:95)
at org.docx4j.openpackaging.io3.stores.ZipPartStore.getBytesFromInputStream(ZipPartStore.java:191)
at org.docx4j.openpackaging.io3.stores.ZipPartStore.<init>(ZipPartStore.java:154)
at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:559)
at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:410)
at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:376)
at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:341)
at org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(WordprocessingMLPackage.java:180)

The file is a 4.0MB word with several images. When loading it the Heap of the JVM sky rockets from 100Mb to more than 300Mb. 200MB heap for loading a 4Mb file?

Getting the heap from my local machine:

Heap Before loading file
Image

Heap after loading
Image

Thats 700Mb of byte[] created when loading a 4Mb file!!
Image

Image

The calling code. We arenusing Spring Resource class. Files are red from a remote server, so they are not local files.:

Code: Select all
private byte[] replaceVars(Resource word, HashMap<String, String> mappings) {
      if (logger.isDebugEnabled()) {
         logger.debug("replaceVars(Resource word={}, HashMap<String,String> mappings={}) - start",
               word.getFilename(), mappings);
      }

      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      InputStream is = null;
      try {
         is = word.getInputStream();
         WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(is);
         VariablePrepare.prepare(wordMLPackage);
         wordMLPackage.getMainDocumentPart().variableReplace(mappings);
         wordMLPackage.save(baos);
         return baos.toByteArray();
      } catch (Exception e) {
         logger.error("replaceVars(Resource, HashMap<String,String>)", e);
      } finally {
         try {
            is.close();
            baos.close();
         } catch (IOException e) {
            logger.error("replaceVars(Resource, HashMap<String,String>)", e);
         }
      }
      return null;
   }



There is clearly something odd here. Any hints? I will be debbing the load code to try to figure out what may be happening.

Edit1:
I downloaded docx4j code. This is log of a simple Junit that runs my code with mockito.

Heap size before MB: 238
01-12-2020 18:19:52.684 DEBUG o.g.documents.DocumentServiceImpl.replaceVars - replaceVars(Resource word=null, HashMap<String,String> mappings={fecha=null, secuencial=1}) - start
01-12-2020 18:19:52.684 DEBUG o.g.documents.DocumentServiceImpl.replaceVars - replaceVars(Resource word=null, HashMap<String,String> mappings={fecha=null, secuencial=1}) - start
01-12-2020 18:19:52.700 INFO o.d.o.io3.stores.ZipPartStore.<init> - Start reading zip...
01-12-2020 18:19:52.707 INFO o.d.o.io3.stores.ZipPartStore.<init> - Heap size before read MB: 238
01-12-2020 18:19:58.811 INFO o.d.o.io3.stores.ZipPartStore.<init> - Heap size after read MB: 726
Heap total after test MB: 3390

Changing /docx4j-core/src/main/java/org/docx4j/openpackaging/io3/stores/ZipPartStore.java getBytesFromInputStream to this code:

Size can be read from the entry.getSize().
Code: Select all
   private byte[] getBytesFromInputStream(InputStream is, int size)
         throws Exception {

         if (size == -1) {
            BufferedInputStream bufIn = new BufferedInputStream(is);
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            BufferedOutputStream bos = new BufferedOutputStream(baos);
            int c = bufIn.read();
            while (c != -1) {
               bos.write(c);
               c = bufIn.read();
            }
            bos.flush();
            baos.flush();
            //bufIn.close(); //don't do that, since it closes the ZipInputStream after we've read an entry!
            bos.close();
            return baos.toByteArray();
         }else {
            byte[] targetArray = new byte[size];
            int read = is.read(targetArray);
            int offset = read;
            while (read != -1) {
               read = is.read(targetArray,offset, size-offset);
               offset += read;
            }
            return targetArray;
         }
      }


Get some interesting result, but more testing is needed to ensure is not luck.
01-12-2020 20:14:33.849 INFO o.d.o.io3.stores.ZipPartStore.<init> - Heap size before read MB: 238
01-12-2020 20:14:33.849 INFO o.d.o.io3.stores.ZipPartStore.<init> - Heap size before read MB: 238
01-12-2020 20:14:34.608 INFO o.d.o.io3.stores.ZipPartStore.<init> - Heap size after read MB: 398

Will keep looking at this and update here with more info.

Best regards,
Guillermo

Re: Heap space use on load docx

PostPosted: Thu Dec 03, 2020 12:10 am
by guille
More info:

The change on the ZipPartStore looks to improve performance on loading the file and somewhat heap usage improvement. But heap keeps growing on OpcPackage load.

Adding this on the
Code: Select all
   
partLoader.getPartByteArra().entrySet().stream().forEach(p->{
log.info("Part {} size {} KB",p.getKey(), p.getValue().getLength() / (1024));
});

I see that each image weights around 33Mb on memory

02-12-2020 10:05:19.915 INFO o.d.o.packages.OpcPackage.lambda$0 - Part word/media/image5.emf size 32603 KB
02-12-2020 10:05:19.916 INFO o.d.o.packages.OpcPackage.lambda$0 - Part word/media/image9.emf size 32761 KB
02-12-2020 10:05:19.916 INFO o.d.o.packages.OpcPackage.lambda$0 - Part word/media/image4.emf size 32682 KB

How can a 4Mb file contain several uncompresed 32Mb images? I am missing something?

Re: Heap space use on load docx

PostPosted: Wed Dec 09, 2020 6:21 am
by jason
Hi Guillermo,

Could you please share a sample docx which exhibits this behaviour? word/media/image4.emf looks like an EMF containing an image. For the purposes of sharing, you could potentially delete any text in the docx, retaining just the EMF image.

cheers .. Jason

Re: Heap space use on load docx

PostPosted: Thu Dec 10, 2020 7:16 am
by guille
Hi Jason,

You can download it from: https://coding12.cl/test.docx

Re: Heap space use on load docx

PostPosted: Fri Dec 11, 2020 10:51 am
by jason
Hi Guillermo,

In your docx, what look like paragraphs of text are actually EMF objects.

I extracted/saved one of them (using unzip, not docx4j).

In line with your observations, this was 32,604KB. When I zipped it up again, it compressed to 267KB.

So what you are seeing is very high compression by zip.

By the way we do have some protection against zip bomb attacks in policePartSize https://github.com/plutext/docx4j/blob/ ... .java#L129

But your content looks legitimate. What is creating these EMF objects? I'd recommend you fix that to emit ordinary docx paragraph content.

cheers .. Jason

Re: Heap space use on load docx

PostPosted: Sat Dec 12, 2020 5:51 am
by guille
Hi Jason,

The content was created by the client's scanner. Thank you for your support. As it has been only one case I spoke with our client to create lighter docs, as is also unnecessary to have that information on the document it self.

Nevertheless, take a look in my first post to the changes in the ZipPartStore.getBytesFromInputStream function. You can get the entry size from entry.getSize() so you can read the bytes without using the streams and improving performance.

Best regards,
Guillermo

Re: Heap space use on load docx

PostPosted: Sat Dec 12, 2020 11:02 am
by jason