Page 1 of 1

Parsing of form fields in Word document

PostPosted: Thu Feb 12, 2015 12:52 am
by chrisspiel
Dear all,

I am using docx4j for parsing word documents containing a number of form fields that each contain a unique field name. These documents are questionnaires and I want to extract the entered values into a database using an automatic import tool. The field names act as unique keys so that they can be matched to the corresponding database field.

Generally speaking, I have been able to parse these fields (i.e. extract the field name and its value, if there is one) by using the following java code:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
                WordprocessingMLPackage wordMLPackage = Docx4J.load(word_file);
                MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();

                // find fields
                ComplexFieldLocator fl = new ComplexFieldLocator();
                new TraversalUtil(documentPart.getContent(), fl);
                {
                    // canonicalise and setup fieldRefs
                    List<FieldRef> fieldRefs = new ArrayList<>();
                    for (P p : fl.getStarts()) {
                        FieldsPreprocessor.canonicalise(p, fieldRefs);
                    }
                   
                    for (FieldRef fr : fieldRefs) {
                        String[] l = new String[2];
                        //initialize with empty string
                        l[1] = "";
                        //get "begin" area of field -- it contains fldchar definition and checkbox values
                        R beg = fr.getBeginRun();
                        ClassFinder fldchar_finder = new ClassFinder(FldChar.class);
                        new TraversalUtil(beg.getContent(), fldchar_finder);
                        for (Object fld_o : fldchar_finder.results) {
                            FldChar fld = (FldChar) fld_o;
                            if (fld.getFldCharType() == STFldCharType.BEGIN) {
                                ClassFinder ctff_finder = new ClassFinder(CTFFData.class);
                                new TraversalUtil(fld, ctff_finder);
                                for (Object ctff_obj : ctff_finder.results) {
                                    if (ctff_obj instanceof CTFFData) {
                                        CTFFData c_d = (CTFFData) ctff_obj;
                                        List<JAXBElement<?>> el_list = c_d.getNameOrEnabledOrCalcOnExit();
                                        for (JAXBElement<?> j_el : el_list) {
                                            if (j_el.getValue() instanceof CTFFName) {
                                                CTFFName n = (CTFFName) j_el.getValue();
                                                l[0] = n.getVal();
                                            }
                                            if (j_el.getValue() instanceof CTFFCheckBox) {
                                                CTFFCheckBox c = (CTFFCheckBox) j_el.getValue();
                                                if (c != null) {
                                                    if (c.getChecked() != null) {
                                                        l[1] = c.getChecked().isVal() ? "true" : "false";
                                                    } else if (c.getDefault() != null) {
                                                        l[1] = c.getDefault().isVal() ? "true" : "false";
                                                    }
                                                }
                                            }
                                        }
                                    }
                                }
                            }
                        }
                        //get "result" area of field -- it contains value in case of text fields
                            R res = fr.getResultsSlot();
                            for (Object o : res.getContent()) {
                            JAXBElement el = (JAXBElement) o;
                            if (el.getValue() instanceof Text) {
                                //add all values to l[1] as they may be distributed over several fields....
                                l[1] += ((Text) el.getValue()).getValue();
                                //System.out.println("Value: " + el.getValue());
                            }
                        }

                        //remove all unneccessary whitespace
                        l[1] = l[1].trim();
                        System.out.println(l[0] + ": " + l[1]);
                        lines.add(l);
                    }
                }
 
Parsed in 0.024 seconds, using GeSHi 1.0.8.4


The above code always works for extracting the field names, however, for the values of the fields it only works if there is e.g. no <p> in the result area of the document (i.e. the user has added a paragraph while filling out the form field). Otherwise the fr.getResultsSlot() call returns an empty list and no value is extracted.

My question is now: Is this a bug/feature? What is the correct/recommended way of parsing these kind of form fields so that the value is always extracted?

Thanks in advance!

Best regards

Christian

Re: Parsing of form fields in Word document

PostPosted: Fri Feb 13, 2015 9:55 pm
by jason
sample docx?

Re: Parsing of form fields in Word document

PostPosted: Sun Feb 15, 2015 5:16 am
by chrisspiel
I have added two sample docx files - "working" is without <p>, "not working" contains <p> tags.

Thanks and best regards

Christian

Re: Parsing of form fields in Word document

PostPosted: Mon Feb 16, 2015 6:31 pm
by cyruswong
Hi Christian and Jason

I am also jammed by the same problem, and I have made another post
docx-java-f6/form-field-extraction-t2078.html

I think it is related to the following xml structure:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
<w:p w:rsidR="0065667C" w:rsidRDefault="00B02AD0" w:rsidP="0065667C">
<w:r>
<w:t xml:space="preserve">Q1.</w:t>
</w:r>
<w:r>
<w:fldChar w:fldCharType="begin">
<w:ffData>
<w:name w:val="Q1"/>
<w:enabled/>
<w:calcOnExit w:val="0"/>
<w:textInput/>
</w:ffData>
</w:fldChar>
</w:r>
<w:bookmarkStart w:id="4" w:name="Q1"/>
<w:r>
<w:instrText xml:space="preserve">FORMTEXT</w:instrText>
</w:r>
<w:r>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="0065667C">
<w:t>line 1</w:t>
</w:r>
</w:p>
<w:p w:rsidR="0065667C" w:rsidRDefault="0065667C" w:rsidP="0065667C">
<w:r>
<w:t>line 2</w:t>
</w:r>
</w:p>
<w:p w:rsidR="0065667C" w:rsidRDefault="0065667C" w:rsidP="0065667C">
<w:r>
<w:t>line 3</w:t>
</w:r>
</w:p>
<w:p w:rsidR="0065667C" w:rsidRDefault="0065667C" w:rsidP="0065667C"/>
<w:p w:rsidR="00D87B64" w:rsidRDefault="0065667C" w:rsidP="0065667C">
<w:r>
<w:t>dadsad</w:t>
</w:r>
<w:bookmarkStart w:id="5" w:name="_GoBack"/>
<w:bookmarkEnd w:id="5"/>
<w:r w:rsidR="00B02AD0">
<w:fldChar w:fldCharType="end"/>
</w:r>
<w:bookmarkEnd w:id="4"/>
 
Parsed in 0.004 seconds, using GeSHi 1.0.8.4


Line 1 - 3 is the data in the field, but your code should not work for the case of multiple lines.

Re: Parsing of form fields in Word document

PostPosted: Tue Feb 17, 2015 9:14 am
by jason
As per the Javadoc in FieldsPreprocessor:

Code: Select all
* Currently the canonicalisation is done at the paragraph level,
* so it is not suitable for fields (such as TOC) which extend across paragraphs.


The method:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
public static P canonicalise(P p, List<FieldRef> fieldRefs)
 
Parsed in 0.013 seconds, using GeSHi 1.0.8.4


should not be used where a field in the P extends into a subsequent P (as is the case with your respective examples, Christian and Cyrus, both of which are FORMTEXT fields).

FieldsPreprocessor as it stands is intended primarily for MERGEFIELD and DOCPROPERTY fields.

A modified design would be required to handle fields which extend across paragraphs.

Re: Parsing of form fields in Word document

PostPosted: Wed Feb 25, 2015 8:47 pm
by cyruswong
Oh! I see!
Thank you Jason with your details explanation!
I think I have to move the extraction task out of JAVA as it is too complex and xpath hard code with Docx4j or POI, and step back trigger a VBA Script in windows platform.