Here are a few fixes/enhancements to docx4j seeking inclusion in docx4j. I've divided them into two patches, one for src/main/... and the other for src/diffx/... The first are really fixes, and the second are enhancements to the way text is tokenized when slicing up XML into events. These apply cleanly to today's trunk, r1390.
----- main.diff -----
This just replaces some characters in one of the comments. '\205' -> "..."
and '\222' -> apostrophe. Without this change, javac complains for me
"unmappable character for encoding UTF8" I'm using JDK 6 on Ubuntu.
This is a change to the type of exception raised when we try to load
an invalid docx file (in particular, a zip file as a docx). Since
"[Content_Types].xml" is not present, the current code raises
NullPointerException, which feels peculiar to have to catch in calling
code. The fix will raise Docx4jException with the message "Couldn't
get [Content_Types].xml from ZipFile", so my calling code can more
cleanly report to a user "hey, that's not a valid docx file"
Word 2007 doesn't like custom property ID's less than 2. This applies
the same workaround that's already in org.docx4j.docProps.custom.Properties
----- diffx.diff -----
These proposed changes provide coarser grained ways to tokenize text when
diffx turns XML into a stream of events. The current diffx stuff creates
a token for every word, and on large documents, the diff algorithms become
unwieldy in terms of memory usage/time. Coarser text splitting makes fewer
TextTokeniserSingleBlock.java - just return 1 token for the whole block
TextTokeniserSentence.java - tokenize on each sentence '.' '?' '!'
DiffxConfig.java, TokeniserFactory.java - Calling code needs a way to
specify "I want to split text by sentence/block". The method proposed
here is in addition to the existing DiffxConfig strategy of ignore/preserve
My submission here seems like the most straightforward way to accomplish
coarser grained text splitting, but of course I'm open to other ways of