A few patch submissions
Posted: Thu Jan 20, 2011 8:03 am
Hello all,
Here are a few fixes/enhancements to docx4j seeking inclusion in docx4j. I've divided them into two patches, one for src/main/... and the other for src/diffx/... The first are really fixes, and the second are enhancements to the way text is tokenized when slicing up XML into events. These apply cleanly to today's trunk, r1390.
----- main.diff -----
src/main/java/org/docx4j/convert/out/pdf/viaXSLFO/SymbolWriter.java:
This just replaces some characters in one of the comments. '\205' -> "..."
and '\222' -> apostrophe. Without this change, javac complains for me
"unmappable character for encoding UTF8" I'm using JDK 6 on Ubuntu.
src/main/java/org/docx4j/openpackaging/io/LoadFromZipNG.java:
This is a change to the type of exception raised when we try to load
an invalid docx file (in particular, a zip file as a docx). Since
"[Content_Types].xml" is not present, the current code raises
NullPointerException, which feels peculiar to have to catch in calling
code. The fix will raise Docx4jException with the message "Couldn't
get [Content_Types].xml from ZipFile", so my calling code can more
cleanly report to a user "hey, that's not a valid docx file"
src/main/java/org/docx4j/openpackaging/parts/DocPropsCustomPart.java:
Word 2007 doesn't like custom property ID's less than 2. This applies
the same workaround that's already in org.docx4j.docProps.custom.Properties
----- diffx.diff -----
These proposed changes provide coarser grained ways to tokenize text when
diffx turns XML into a stream of events. The current diffx stuff creates
a token for every word, and on large documents, the diff algorithms become
unwieldy in terms of memory usage/time. Coarser text splitting makes fewer
events.
TextTokeniserSingleBlock.java - just return 1 token for the whole block
TextTokeniserSentence.java - tokenize on each sentence '.' '?' '!'
DiffxConfig.java, TokeniserFactory.java - Calling code needs a way to
specify "I want to split text by sentence/block". The method proposed
here is in addition to the existing DiffxConfig strategy of ignore/preserve
whitespace.
My submission here seems like the most straightforward way to accomplish
coarser grained text splitting, but of course I'm open to other ways of
doing it.
Dave
Here are a few fixes/enhancements to docx4j seeking inclusion in docx4j. I've divided them into two patches, one for src/main/... and the other for src/diffx/... The first are really fixes, and the second are enhancements to the way text is tokenized when slicing up XML into events. These apply cleanly to today's trunk, r1390.
----- main.diff -----
src/main/java/org/docx4j/convert/out/pdf/viaXSLFO/SymbolWriter.java:
This just replaces some characters in one of the comments. '\205' -> "..."
and '\222' -> apostrophe. Without this change, javac complains for me
"unmappable character for encoding UTF8" I'm using JDK 6 on Ubuntu.
src/main/java/org/docx4j/openpackaging/io/LoadFromZipNG.java:
This is a change to the type of exception raised when we try to load
an invalid docx file (in particular, a zip file as a docx). Since
"[Content_Types].xml" is not present, the current code raises
NullPointerException, which feels peculiar to have to catch in calling
code. The fix will raise Docx4jException with the message "Couldn't
get [Content_Types].xml from ZipFile", so my calling code can more
cleanly report to a user "hey, that's not a valid docx file"
src/main/java/org/docx4j/openpackaging/parts/DocPropsCustomPart.java:
Word 2007 doesn't like custom property ID's less than 2. This applies
the same workaround that's already in org.docx4j.docProps.custom.Properties
----- diffx.diff -----
These proposed changes provide coarser grained ways to tokenize text when
diffx turns XML into a stream of events. The current diffx stuff creates
a token for every word, and on large documents, the diff algorithms become
unwieldy in terms of memory usage/time. Coarser text splitting makes fewer
events.
TextTokeniserSingleBlock.java - just return 1 token for the whole block
TextTokeniserSentence.java - tokenize on each sentence '.' '?' '!'
DiffxConfig.java, TokeniserFactory.java - Calling code needs a way to
specify "I want to split text by sentence/block". The method proposed
here is in addition to the existing DiffxConfig strategy of ignore/preserve
whitespace.
My submission here seems like the most straightforward way to accomplish
coarser grained text splitting, but of course I'm open to other ways of
doing it.
Dave