Page 1 of 1

Design for large word documents

PostPosted: Thu Mar 17, 2016 1:17 am
by bjenkins
I've been using docx4j (v 3.2.1, on Windows 7 Pro 64 bit and MS-Word 2013, 64-bit Java 8), off and on, for a couple of years now and really appreciate this product. I am currently creating very large documents (~ 15k pages) which have bookmarks all over the place to help people navigate around such a huge tome. When I FIRST load the docx4j created docx file, it takes a LONG time to load (like 30-60 minutes on a 4gb machine with 4 cores). I assume that MS-Word is probably optimizing the document some how because future loads of the .docx file are much faster. I'm guessing that the way I am creating the docx file must be inefficient. So, my general questions are:
1. Are there some general design patterns for creating efficient documents you can recommend
2. Are there any anti-patterns I should stay away from in general and for large documents specifically.
3. Should I step up to v 3.2.2 (or higher)(I read the release notes and did not see anything about performance increases)
4. Throw more cores or RAM at the problem?
5. Would upgrading to a newer version of Word help?

Thanks again for these great libraries!

Bart

Re: Design for large word documents

PostPosted: Thu Mar 17, 2016 8:05 am
by jason
I assume after your first load in Word 2013, you are saving it? And it is this saved document which is faster to load in Word next time?

You might visually compare the first 100 lines say of the document.xml to see what has changed....

I know it will put in w:lastRenderedPageBreak, and I guess this might be why it is quicker .. the first time, do you see Word says "repaginating" in the status bar at the bottom while it is slow?

As another experiment, remove the bookmarks (ie don't have docx4j write them). Does this affect the opening speed? (Word doesn't write an index of them to the docx file .. and I'm not aware that it saves anything to the local machine .. but you might check this by opening your re-saved docx on some other machine to see whether it is fast or slow there..)

Assuming you can't share your docx, a proxy might be ECMA 376 (1 ed) part 4 - markup language reference docx which is 5220 pages.