Page 1 of 1

Can docx4j search some heading content and read tables

PostPosted: Thu Oct 31, 2013 2:30 am
by david.zhaowl
Hi everyone,

I need to do a project which needs to read/write tables after a certain heading in docx file. I googled a bit and got to know docx4j is most powerful java library for processing docx files. Could you please tell me if docx4j is capable to search for a certain heading, and then read and write tables after that heading?

Appreciate your help in advance.

Re: Can docx4j search some heading content and read tables

PostPosted: Thu Oct 31, 2013 8:05 am
by jason
Sure.

You can search for the heading via TraversalUtil (see https://github.com/plutext/docx4j/tree/ ... 4j/finders for some examples) or using XPath.

See Getting Started for more info.

You can search for headings via their text content, or you could ID them.

With docx4j nightlies there is get|setParaId in https://github.com/plutext/docx4j/blob/ ... wml/P.java

Or you could put a content control around the heading and table, and use a w:tag on the content control (or just its w:id). This would be a neat approach.

If your intention is just to add rows of data to the tables, I'd encourage you to explore content control data binding (OpenDoPE extensions - www.opendope.org), since this way, all the hard work is done for you.

Re: Can docx4j search some heading content and read tables

PostPosted: Sat Nov 02, 2013 1:28 am
by david.zhaowl
Thanks for your reply Jason. I'll start reading documentations and try some demos. :D

Re: Can docx4j search some heading content and read tables

PostPosted: Sat Dec 07, 2013 3:50 am
by david.zhaowl
jason wrote:Sure.

You can search for the heading via TraversalUtil (see https://github.com/plutext/docx4j/tree/ ... 4j/finders for some examples) or using XPath.

See Getting Started for more info.

You can search for headings via their text content, or you could ID them.

With docx4j nightlies there is get|setParaId in https://github.com/plutext/docx4j/blob/ ... wml/P.java

Or you could put a content control around the heading and table, and use a w:tag on the content control (or just its w:id). This would be a neat approach.

If your intention is just to add rows of data to the tables, I'd encourage you to explore content control data binding (OpenDoPE extensions - http://www.opendope.org), since this way, all the hard work is done for you.


I'm now trying with the finder approach, is there mapping java class for Headings in word file? Thanks.

David Z.

Re: Can docx4j search some heading content and read tables

PostPosted: Sat Dec 07, 2013 5:53 am
by jason
No, a heading is a just a paragraph (w:p) with a particular style. For example (2 examples):

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
    <w:p >
      <w:pPr>
        <w:pStyle w:val="Heading1"/>
      </w:pPr>
      <w:r>
        <w:t>H1</w:t>
      </w:r>
    </w:p>


    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading2"/>
      </w:pPr>
      <w:r>
        <w:t>H2</w:t>
      </w:r>
    </w:p>
 
Parsed in 0.001 seconds, using GeSHi 1.0.8.4


More technically, normally a heading style has outline level (w:outlineLvl) set:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
  <w:style w:type="paragraph" w:styleId="Heading1">
    <w:name w:val="heading 1"/>
    <w:basedOn w:val="Normal"/>
    <w:next w:val="Normal"/>
    <w:link w:val="Heading1Char"/>
    <w:uiPriority w:val="9"/>
    <w:qFormat/>
    <w:rsid w:val="00DC7557"/>
    <w:pPr>
      <w:keepNext/>
      <w:keepLines/>
      <w:spacing w:before="480" w:after="0"/>
      <w:outlineLvl w:val="0"/>
    </w:pPr>
    <w:rPr>
      <w:rFonts w:asciiTheme="majorHAnsi" w:eastAsiaTheme="majorEastAsia" w:hAnsiTheme="majorHAnsi" w:cstheme="majorBidi"/>
      <w:b/>
      <w:bCs/>
      <w:color w:val="365F91" w:themeColor="accent1" w:themeShade="BF"/>
      <w:sz w:val="28"/>
      <w:szCs w:val="28"/>
    </w:rPr>
  </w:style>
 
Parsed in 0.002 seconds, using GeSHi 1.0.8.4

Re: Can docx4j search some heading content and read tables

PostPosted: Sat Dec 07, 2013 6:26 am
by david.zhaowl
So for example, if I want to search for "Test Cases" in Heading1 style, I have to go over all the paragraphs in the document and then if their style is in Heading1? Is there any other more effective approach to do this? Thanks.

David Z.

Re: Can docx4j search some heading content and read tables

PostPosted: Sat Dec 07, 2013 11:55 am
by jason
If you can assume none of your headings are inside tables, it is dead easy. You just iterate over the content:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
                for (Object o : wordMLPackage.getMainDocumentPart().getContent()) {
                       
                        if (o instanceof P) {
                                // etc
                        }
                }
 
Parsed in 0.015 seconds, using GeSHi 1.0.8.4


If the content could be inside tables, then you need to traverse, or use XPath. Both quite straightforward; your XPath expression could require w:pStyle/@w:val to be "Heading1"

Re: Can docx4j search some heading content and read tables

PostPosted: Tue Dec 10, 2013 5:50 am
by david.zhaowl
Hi Jason,

Thanks for you reply. I have found the heading successfully. My actual attempt is to read all the tables after this heading. At first I thought the heading and tables are included in some data structure, so that I want to look for the heading and then use heading.getParent().getContent() to get all the tables after the heading. Now I found I was wrong, the tables' parent object actually is the MainDocumentPart. Is there a way to get the tables after this heading? The fact is that I can't ID these tables because they will be input by other users. and there're other tables in the document (before the heading), thus I can't use "for(Object o : documentPart.getContent()) { If(o instanceof Tbl)(.......) }" to get all the tables in the document.

Looking forward to your reply. Thanks.

David Z.

Re: Can docx4j search some heading content and read tables

PostPosted: Tue Dec 10, 2013 6:53 am
by jason
You could track your state boolean foundHeading... if you are sure your table will immediately follow your heading.

What do you want to do with the table once you've found it?

You could also consider putting the table inside a content control (with our without the heading). Could you put a content control in the input docx, and train users to enter their table inside that (perhaps with an instruction PLACE TABLE HERE)?

Re: Can docx4j search some heading content and read tables

PostPosted: Wed Dec 11, 2013 2:25 am
by david.zhaowl
I want to realize two functions:
1. I want to read all the data from the tables after the specific heading and then import the data to a certain tool.
2. I retrieve data from the tool and export the data to all the tables after the heading.

All the tables I want will be following the Heading1 and for each table there's a Heading2.

The docx file is a template where users could add as many tables as they want, each table has many columns.

I don't really know how content control works, could you give me some example? I could give a try to push that to users if there's a proper solution.

Many thanks for you effort here.

David Zhao.