Page 1 of 1

Optimization for very large documents?

PostPosted: Fri Nov 08, 2019 3:19 am
by mobiusdt
Background: Client is requesting a report in multiple formats, one of which is docx. We decided the easiest route would be to have an HTML template to populate the report data into and then convert the resulting string to XHTML and then the final format. The report contains details about a given record, and there is one table in the report for a type of association. We have recently integrated with a dataset that has records with over 100k of that type of association. We've been getting timeout issues for docx when the table is housing 100k+ rows. We can't drop the table, and we can't extend timeout time.

Question: Is it possible to optimize the XHTML conversion process, or should I look into handling this as a special case and basically not populate the table until after conversion? I noticed the comment on the XHTMLImporterImpl about the FSEntityResolver issue, but it sounds like that is something that needs to be fixed in FlyingSaucer, please correct me if I am misinterpreting that. Thank you for your time.

As an aside, we have advised the client that a report of that size isn't ideal for docx, as the format is generally unstable dealing with files past a certain size threshold.

Re: Optimization for very large documents?

PostPosted: Tue Nov 12, 2019 12:10 pm
by jason
Interesting question. Would need to examine the code (at both docx4j-ImportXHTML and Flying Saucer levels) to answer properly. Before introducing XHTML import into the equation, it would be prudent to check docx4j itself handles creating/saving a document with 100K rows OK. Have you tried this? (How many columns does your table have? What do the cells contain?) You may need to adjust the RAM allocated to the JVM.

mobiusdt wrote:As an aside, we have advised the client that a report of that size isn't ideal for docx, as the format is generally unstable dealing with files past a certain size threshold.


I'm not aware of a limit on the number of table rows in Word: https://docs.microsoft.com/en-us/office ... limitation

Re: Optimization for very large documents?

PostPosted: Wed Nov 13, 2019 5:39 am
by mobiusdt
So, I ended up implementing the latter option (manually importing table contents after import) to get around the slow import. It does generate the table correctly with 100k+ rows fine and saves relatively quickly. The total time is around 40 seconds, well within the 2 minute timeout limit. I would still prefer to handle everything in the xhtml import process, as this current implementation is much more convoluted, and winds up making the template/code relationship less flexible.

Thanks for clarification on the file size, it seems to be much larger than I was lead to believe.