Page 1 of 1

Is it possible to extract all text also Tab and Hyphen?

PostPosted: Fri Sep 26, 2014 12:19 am
by m4tt3
Hi,
i would like to know if it is possible to extract all text element, also with tab and hyphen and put it in a string.

i've tried with TextUtils.java but doesn't work, don't catch the tab and hyphen element.
I've implemented my metod that traverse all the children of the document, but i can't generalize well, because every document is different and there is always a case that i don't cover. The last one is ProofErr. :)

Thank a lot

Re: Is it possible to extract all text also Tab and Hyphen?

PostPosted: Sat Oct 04, 2014 1:56 pm
by jason
The problem with tab and hyphen, as I guess you know, is that they aren't represented in the docx as normal characters.

Tab is w:tab

A hyphen might be a hyphen character, or it might be displayed (without being actually in the docx), or it might be:

http://webapp.docx4java.org/OnlineDemo/ ... yphen.html

or http://webapp.docx4java.org/OnlineDemo/ ... yphen.html

Replicating Word's hyphenation behaviour would be a challenge.

But for the others, there are three approaches which occur to me:

1. generalising your traverse approach (are you using TraversalUtil.getChildrenImpl?)

2. doing it in XSLT (you can do this in docx4j, but XSLT is probably slower, and a mix of technologies)

3. marshal the main document part to a string, do suitable string replacements, then unmarshall, then use TextUtils