Page 1 of 1

Reading a Word docx to output text to Excel

PostPosted: Wed Jun 07, 2017 8:26 am
by cveridis
I was accomplishing this in POI, but the ability to read Font type and size was not effective, and the research I did said docx4j was better at that.
My issues so far:
1. I can't get the text from a line(paragraph) consistently; the text is broken into multiple org.docx4j.wml.P/R/Text objects even though the 'line' is unbroken in the text
2. I am not getting the Font type and Size consistently; it is often null and so doesn't allow an easy way to read the font information
3. Being an intermediate Java programmer, I am not really seeing a simple path example in my internet research similar to the functions of POI for reading the text; though POI is limited at reading the font as well (especially for docx), it is simpler in structure.

I have mangled a couple of versions of traverse/dump code and I am having no issues reading the file as they are designed, but I need to output the data to a tab-delimited or Excel file. I am trying to read what are called Copy Documents for websites and email content specification. I'd like to output that to an Excel/delimited text file in 2 columns, basically:
Designator/linkname/alias
Copy Text.
The files are put together by Program Managers, so they have differing formats, where the first part is sometimes contained within [] brackets, sometimes Bold text at a certain size, and even italics (which is why I need to read the font type & size) - is there an example out there that has this easily implemented?
-OR- an easier question might be - am I picking the most difficult path by using the Traverse method: (wordMLPackage) over the JAXB/OpenXML Parts method? I have noticed they are differing methods/code.
Thanks for any help you can offer! Sorry for the newbie approach.

Michael (cveridis)

Re: Reading a Word docx to output text to Excel

PostPosted: Wed Jun 07, 2017 9:44 am
by jason
Hi Michael

cveridis wrote:I can't get the text from a line(paragraph) consistently; the text is broken into multiple org.docx4j.wml.P/R/Text objects even though the 'line' is unbroken in the text


That'll be because that's how it is represented in the XML emitted by Word. Use the docx4j webapp or the Docx4j Helper Word AddIn to see the XML, or just unzip the docx. Or in docx4j, marshall to String.

If you want just the text, see https://github.com/plutext/docx4j/blob/ ... s.java#L53

There is a tension between wanting "the text" and getting the font information.

Since font info has document specific defaults which can be overridden in actual w:p/w:r, I'd suggest you look at the output to XHTML/XSL FO PDF, and adapt that. There are 2 ways of doing each output: via XSLT, and via traversal. Pick whichever you are most comfortable with.

The XSL FO output is probably more useful for you, because it makes all the info explicit in each paragraph, as opposed to CSS style inheritance.

I suggest you start with tab-delimited output, and when you get that working, maybe consider Excel output.