Page 1 of 1

Importing a WordprocessingMLPackage obj for performance

PostPosted: Mon Feb 25, 2013 1:43 pm
by andrewk
I am using WordprocessingMLPackage.load(new ByteArrayInputStream(b)) to load in the wordMLPackage on each request that requries docx manipulation but want to explore alternatives to this to improve performance down to 10s of milliseconds rather than seconds. The options explored so far include

1. WordprocessingMLPackage.load(bais) - perf metrics can be at best 500ms using a test docx
2. wordMLPackage.clone(); - perf metrics are similar to load() at 500ms using a test docx
3. Serializing results of WordprocessingMLPackage.load(bais) - WordprocessingMLPackage not serializable
4. Creating wordMLPackage from org.docx4j.xmlPackage.Package using FlatOpcXmlImporter - perf metrics can be at best 50ms using a test docx
5. Pooling wordMLPackage objects for re-use - overkill at this stage

Option 4 seems postive however i am concerned that this approach is not thread safe. In this basic test harness the FlatOpcXmlImporter object is constructed reusing the unmarshalled wmlPackageEl Package object and unsure whether this creates a deep clone or is basically referencing the same Parts objects and possible threading issues. Does anyone have an opinion on this?

Code: Select all
        RandomAccessFile f = new RandomAccessFile(inputfilepath, "r");
        byte[] b = new byte[(int)f.length()];
        final ByteArrayInputStream bais = new ByteArrayInputStream(b);
        f.read(b);
       
        StreamSource source = new StreamSource(bais);
        org.docx4j.xmlPackage.Package wmlPackageEl = ((JAXBElement<org.docx4j.xmlPackage.Package>) u
            .unmarshal(source)).getValue();

        for (int i = 0; i < 100; i++) {
            long start = System.currentTimeMillis();
            FlatOpcXmlImporter xmlPackage = new FlatOpcXmlImporter(wmlPackageEl);
            wordMLPackage = (WordprocessingMLPackage) xmlPackage.get();
            times.add(new Long(System.currentTimeMillis() - start));
        }
       


Also, are there other options that have been successfully in improving the performance of obtaining WordprocessingMLPackage objects and avoiding the constant rebuilding of parts and relationships?

Re: Importing a WordprocessingMLPackage obj for performance

PostPosted: Mon Feb 25, 2013 4:49 pm
by jason
Please note also that GitHub trunk has a package org.docx4j.openpackaging.io3 which supports storing a docx unzipped and lazy unmarshalling (of all the XML parts, you might only really need an XML representation of the main document part document.xml)

If you are using current dev source code, you'll already be using that via method 1.

Once you have a WordprocessingMLPackage object, what do you want to do with it? This might suggest some other approaches.

Re: Importing a WordprocessingMLPackage obj for performance

PostPosted: Mon Feb 25, 2013 5:05 pm
by andrewk
The main operations include simple token replacement (plain text) checkboxes checking, dropdown selections and adding rows to tables (with some border formatting).

The main aim i suppose is that on start-up of jboss we want to load up the docx from file or database bytes into memory and thus take the upfront cost of the load for all documents, then on each client web request the thread is handed a *cloned* object representing the WordprocessingMLPackage at which point token replacement etc could occur using client data.

Re: Importing a WordprocessingMLPackage obj for performance

PostPosted: Mon Feb 25, 2013 5:50 pm
by jason
andrewk wrote:The main operations include simple token replacement (plain text) checkboxes checking, dropdown selections and adding rows to tables (with some border formatting).


Does the client request always end with them getting a docx?

Do you manipulate anything besides the Main Document Part?

Re: Importing a WordprocessingMLPackage obj for performance

PostPosted: Mon Feb 25, 2013 5:56 pm
by andrewk
Yes, clients always end up receiving a docx (or docm for macro enabled documents) and only the document part is manipulated.

Re: Importing a WordprocessingMLPackage obj for performance

PostPosted: Mon Feb 25, 2013 6:38 pm
by jason
When you start JBoss, you could create a DOM document or byte[] representing the contents of the main document part.

When a client request comes in, you unmarshall that; maybe that's all you need, or maybe you need to make it a main document part, or perhaps a main document part in a shell package.

Then you do your manipulation of the JAXB objects using docx4j.

When you are done, you marshall it, and inject document.xml into a zip file which already contains everything except that part. I assume you can find a zip implementation which supports doing that.

Critique: A very non-standard usage of docx4j, which is decidedly not recommended/supported (ie you'd be on your own), but an interesting thought experiment, and should be quick...

Re your option 4, org.docx4j.xmlPackage stores the content using JAXB, but getXmlData().getAny() returns a DOM Element, which is then unmarshalled. So I guess the question comes down to whether N different Unmarshaller objects created from a single JAXB Context can each operate on a single DOM tree at the same time. Check the JAXB Javadoc and possibly the spec and if no answer there, ask on StackOverflow (tagged JAXB). If this much is OK, the approach may be free of threading issues.