Page 1 of 1

Parsing a docx

PostPosted: Wed Aug 08, 2012 1:44 am
by stoan
Hello everyone

I'm new to Java and Docx4j, I have been assigned a task to parse a docx . Example of the document below:

In each document, they can be any number of Tabs and Fields and each Tab name,description must be extracted and saved to the database, as well as Field names,Field descriptions. My question is how do i parse a document like this?

The structure of the docx file:

Form name

Form description

Tab name

Tab description

Tab name

Tab description

Tab name

Tab description

Field name

Field description

Field name

Field description

Field name

Field description

Field name

Field description

Re: Parsing a docx

PostPosted: Wed Aug 08, 2012 9:29 am
by jason
To start, you need to know how things are represented in WordML, so unzip the docx, and post the contents of word/document.xml.

To format as XML, post surrounded by square brackets containing the word 'xml'.

Re: Parsing a docx

PostPosted: Wed Aug 08, 2012 10:57 pm
by stoan
Your message contains 883470 characters. The maximum number of allowed characters is 60000.

I have attached the xml file.

Re: Parsing a docx

PostPosted: Thu Aug 09, 2012 9:07 am
by jason
No attachment I could see ...

Please just post an excerpt of the XML, we don't need to see it all. Just the first instances of:


Form name

Form description

Tab name

Tab description

:

Field name

Field description