Plutext

Posted: **Thu Jan 20, 2011 8:03 am**

Hello all,

Here are a few fixes/enhancements to docx4j seeking inclusion in docx4j. I've divided them into two patches, one for src/main/... and the other for src/diffx/... The first are really fixes, and the second are enhancements to the way text is tokenized when slicing up XML into events. These apply cleanly to today's trunk, r1390.

----- main.diff -----

src/main/java/org/docx4j/convert/out/pdf/viaXSLFO/SymbolWriter.java:

This just replaces some characters in one of the comments. '\205' -> "..."
and '\222' -> apostrophe. Without this change, javac complains for me
"unmappable character for encoding UTF8" I'm using JDK 6 on Ubuntu.

src/main/java/org/docx4j/openpackaging/io/LoadFromZipNG.java:

This is a change to the type of exception raised when we try to load
an invalid docx file (in particular, a zip file as a docx). Since
"[Content_Types].xml" is not present, the current code raises
NullPointerException, which feels peculiar to have to catch in calling
code. The fix will raise Docx4jException with the message "Couldn't
get [Content_Types].xml from ZipFile", so my calling code can more
cleanly report to a user "hey, that's not a valid docx file"

src/main/java/org/docx4j/openpackaging/parts/DocPropsCustomPart.java:

Word 2007 doesn't like custom property ID's less than 2. This applies
the same workaround that's already in org.docx4j.docProps.custom.Properties

----- diffx.diff -----

These proposed changes provide coarser grained ways to tokenize text when
diffx turns XML into a stream of events. The current diffx stuff creates
a token for every word, and on large documents, the diff algorithms become
unwieldy in terms of memory usage/time. Coarser text splitting makes fewer
events.

TextTokeniserSingleBlock.java - just return 1 token for the whole block

TextTokeniserSentence.java - tokenize on each sentence '.' '?' '!'

DiffxConfig.java, TokeniserFactory.java - Calling code needs a way to
specify "I want to split text by sentence/block". The method proposed
here is in addition to the existing DiffxConfig strategy of ignore/preserve
whitespace.

My submission here seems like the most straightforward way to accomplish
coarser grained text splitting, but of course I'm open to other ways of
doing it.

Dave

Posted: **Thu Feb 03, 2011 12:11 pm**

Hi Dave

Thanks for these.

davebrown wrote:src/main/java/org/docx4j/openpackaging/io/LoadFromZipNG.java:

This is a change to the type of exception raised when we try to load
an invalid docx file (in particular, a zip file as a docx). Since
"[Content_Types].xml" is not present, the current code raises
NullPointerException, which feels peculiar to have to catch in calling
code. The fix will raise Docx4jException with the message "Couldn't
get [Content_Types].xml from ZipFile", so my calling code can more
cleanly report to a user "hey, that's not a valid docx file"

Can we not further simplify to:

Index: src/main/java/org/docx4j/openpackaging/io/LoadFromZipNG.java
===================================================================
--- src/main/java/org/docx4j/openpackaging/io/LoadFromZipNG.java (revision 1410)
+++ src/main/java/org/docx4j/openpackaging/io/LoadFromZipNG.java (working copy)
@@ -296,19 +296,11 @@
}

private static InputStream getInputStreamFromZippedPart(HashMap<String, ByteArray> partByteArrays,
- String partName)
- //private static InputStream getInputStreamFromZippedPart(ZipFile zf, String partName)
- throws IOException, NullPointerException {
+ String partName) throws IOException {

- InputStream in = null;
- //in = zf.getInputStream( zf.getEntry(partName ) );
- try {
- in = partByteArrays.get(partName).getInputStream();
- } catch (NullPointerException npe) {
- log.warn("Part " + partName + " was null");
- throw npe;
- }
- return in;
+ ByteArray bytes = partByteArrays.get(partName);
+ if (bytes == null) throw new IOException("part '" + partName + "' not found");
+ return bytes.getInputStream();
}

I'm committing that as r1414.

The others I applied without change.

cheers .. Jason

Posted: **Fri Feb 04, 2011 9:16 am**

Thanks Jason. It helps to get these committed. I have diffs accumulate in my working copy, and then I worry I could lose them. At the very least, I have to ship them between machines.

I suppose I'd been trying to keep the size of the diffs small, rather than making the whole result clean and readable. I'll focus on the latter for subsequent patches.

Thanks again,

Dave

Plutext

A few patch submissions

A few patch submissions

Re: A few patch submissions

Re: A few patch submissions