Page 1 of 1

using docx4j w/ word 2003 merge fields

PostPosted: Tue Feb 22, 2011 7:27 am
by ward_f
Has anyone used the docx4j to deal with the merge fields from word 2003? If so, I'd be open to some tips and/or best practices.

I created a simple merge template using word 2003 and saved it as a docx. Here's an excerpt from the document.xml:

<w:p w:rsidR="000808B5" w:rsidRDefault="000808B5">
<w:r>
<w:t xml:space="preserve">But this third and last paragraph has a merge field named Biz right here: </w:t>
</w:r>
<w:fldSimple w:instr=" MERGEFIELD Biz \* MERGEFORMAT ">
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:t>«Biz»</w:t>
</w:r>
</w:fldSimple>
</w:p>

Looks like it could be "as simple as" text substitution, but since I'm still learning all this I'm not sure if there are gotchas I haven't ran across yet.

Thanks in advance,
Ward

Oh, and nice work on docx4j!

Re: using docx4j w/ word 2003 merge fields

PostPosted: Tue Feb 22, 2011 11:43 am
by jason
From http://www.documentinteropinitiative.or ... ec913.aspx

Fields shall be implemented in XML using either of two approaches:

• As a simple field implementation, using the fldSimple element, or

• As a complex field implementation, using a set of runs involving the fldChar and instrText elements.

For a simple field implementation, only one element, fldSimple, shall be used, in which case, its instr attribute shall contain a field, and the body of the element shall contain the most recently updated field result. [Example: Here is the corresponding XML for a simple field implementation of DATE:

<w:r>
<w:fldSimple w:instr="DATE"> 12/31/2005 </w:fldSimple>
</w:r>

end example]

For a complex field implementation, a set of runs shall be used with each run containing, in sequence, the following elements:

• fldChar with attribute fldCharType value begin,

• One or more instrText elements, which, collectively, contain a complete field,

• Optionally,

• fldChar with attribute fldCharType value separate, which separates the field from its field result,

• Any number of runs and paragraphs that contains the most recently updated field result, and

• fldChar with attribute fldCharType value end.

[Note: Fields that are for display purposes only have no need to, and do not, store a field result. end note][Example: Here is the corresponding XML for a complex field implementation of DATE:

<w:r>
<w:fldChar w:fldCharType="begin"/>
</w:r>

<w:r>
<w:instrText xml:space="preserve"> DATE </w:instrText>
</w:r>

<w:r>
<w:fldChar w:fldCharType="separate"/>
</w:r>

<w:r>
<w:t>12/31/2005</w:t>
</w:r>

<w:r>
<w:fldChar w:fldCharType="end"/>
</w:r>

end example]

[Note: Every simple field implementation for a given field has a corresponding complex field implementation. However, not every complex field implementation has a corresponding simple field implementation. If some characters in a field have different run properties than others, that field must be implemented using multiple runs, and that requires that complex field implementation be used. For an example, see §2.16.4.3, where the first letter of a DATE field is made bold, underlined, and red, while the other letters have none of these properties. end note]

As shown in §2.16.1, the instruction of one field can be another field, allowing fields to nest. In such cases, the XML run sequence for the inner field is defined at the point of reference for that inner field, inside the outer field's XML run sequence. [Example: Consider the following sentence:

It's IF DATE \@ "M-d"<>"1-1" "not " new year's day.

The IF field contains the nested field DATE \@ "M-d". When updated, on January 1 of any year, the result sentence is "It's new year's day." On all other days of the year, the resulting sentence is "It's not new year's day."


Note that fields can be nested.

The section before contains an overview of the syntax:
The general syntax of a field is as follows:

field:
field-type [ instruction ]

field-type:
date-and-time
document-automation
document-information
equations-and-formulas
index-and-tables
links-and-references
mail-merge
numbering
user-information
form-field

date-and-time:
CREATEDATE | DATE | EDITTIME | PRINTDATE | SAVEDATE | TIME

document-automation:
COMPARE | DOCVARIABLE | GOTOBUTTON | IF | MACROBUTTON | PRINT

document-information:
AUTHOR | COMMENTS | DOCPROPERTY | FILENAME | FILESIZE | INFO
| KEYWORDS | LASTSAVEDBY | NUMCHARS | NUMPAGES | NUMWORDS | SUBJECT
| TEMPLATE | TITLE

equations-and-formulas:
= formula | ADVANCE | EQ | SYMBOL

index-and-tables:
INDEX | RD | TA | TC | TOA | TOC | XE

links-and-references:
AUTOTEXT | AUTOTEXTLIST | BIBLIOGRAPHY | CITATION | HYPERLINK | INCLUDEPICTURE | INCLUDETEXT
| LINK | NOTEREF | PAGEREF | QUOTE | REF | STYLEREF

mail-merge:
ADDRESSBLOCK | ASK | COMPARE | DATABASE | FILLIN | GREETINGLINE | IF
| MERGEFIELD | MERGEREC | MERGESEQ | NEXT | NEXTIF | SET | SKIPIF

numbering:
AUTONUM | AUTONUMLGL | AUTONUMOUT | BARCODE | LISTNUM | PAGE | REVNUM
| SECTION | SECTIONPAGES | SEQ

user-information:
USERADDRESS | USERINITIALS | USERNAME

form-field:
FORMCHECKBOX | FORMDROPDOWN | FORMTEXT

instruction:
field
field-argument
switches
field-argument switches
switches field-argument

field-argument:
[ " ] text [ " ]

switches:
switch
switch switches

switch:
formatting-switch
field-specific-switch

formatting-switch:
date-and-time-formatting-switch
numeric-formatting-switch
general-formatting-switch

field-specific-switch:
\field-switch-character [ field-argument ]

field-switch-character:
!
one or two Latin letters

formula is discussed in §2.16.3, and formatting-switches are discussed in §2.16.4.

If the text in a field-argument contains white space, the delimiting double-quote characters shall be present; otherwise, they are optional. To include a double-quote character in text, it shall be preceded with a backslash (\). [Example: The field argument "\"name\"" results in the argument's actually being "name". end example] To include a backslash character in text, it shall be preceded with another backslash (\). [Example: File system pathnames on some systems use a backslash as a directory separator, as in the field

INCLUDETEXT "E:\\ReadMe.txt"

in which case, each such separator needs to be preceded with a backslash, as shown above. end example]

Arbitrary amount of white space can occur before the first token, after the last token, and between successive tokens, including no white space at all.


The first step to adding higher-level support for fields to docx4j would be to create a field parser - preferably one supporting nested fields. This requires parsing either the flat JAXB element or XML representations...

It would also be worth articulating the use cases we are trying to support.

Re: using docx4j w/ word 2003 merge fields

PostPosted: Wed Feb 23, 2011 3:53 am
by ward_f
Thank you for the input, Jason.

>> ... articulate the use cases

Our current needs are rather simplistic. We must be able to produce merge documents from within our application (supporting both word 2003 and 2007). The merge fields in the document correspond to "hooks" in our application wherein these hooks provide the data for the field. For example, a LastName merge field corresponds to the LastName "hook" in our app. But this is just a domain-specific detail.

I may be being overly simplistic here, but from an API perspective for our needs, I could envision something along the lines of the following:

List listOfMergeFieldNames = documentPart.getMergeFieldNames();

For each name in the list, I would do what I need to do to get the associated data for that merge field from our application, building a map along the way:

map.put("FirstName", "Ferdinand");
map.put("LastName", "Porsche");
...

And then call something along the lines of:
documentPart.doMergeFields(map);

which does the replacement of the token with the actual data. Somewhere along the way we would have somehow "cloned" the merge document so that the replacement is happening in a new document.

>> The first step to adding ... create a field parser

Okay, good. That is pretty much the direction I figured. I'm climbing the API learning curve now. If you have a recommendation on which sample might illustrate best practices or provide a platform for expansion, I'm all ears. I'm going through them now. The OpenMainDocumentAndTraverse might hold promise.

Thanks again,
Ward

Re: using docx4j w/ word 2003 merge fields

PostPosted: Wed Feb 23, 2011 8:42 pm
by jason
Brainstorming here .. one approach would be to wrap the fields in content controls:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting

    <w:p>
      <w:r>
        <w:t xml:space="preserve"> ordinary text before the field</w:t>
      </w:r>
      <!-- Outer sdt houses an sdt containing the field code, and optionally, the field result.-->
      <w:sdt>
        <w:sdtPr>
          <w:tag w:val="fieldname=field1"/>
        </w:sdtPr>
        <w:sdtContent>
        <!-- Inner sdt houses the field code.-->
          <w:sdt>
            <w:sdtPr>
              <w:tag w:val="fieldcode"/>
            </w:sdtPr>
            <w:sdtContent>
              <w:r>
                <w:fldChar w:fldCharType="begin"/>
              </w:r>
              <w:r>
                <w:instrText xml:space="preserve"> DATE </w:instrText>
              </w:r>
              <w:r>
                <w:fldChar w:fldCharType="end"/>
              </w:r>
            </w:sdtContent>
          </w:sdt>
          <!-- Optional sdt houses the field result.-->
          <w:sdt>
            <w:sdtPr>
              <w:tag w:val="fieldresult"/>
            </w:sdtPr>
            <w:sdtContent>
              <w:r>
                <w:t>12/31/2005</w:t>
              </w:r>
            </w:sdtContent>
          </w:sdt>
        </w:sdtContent>
      </w:sdt>
      <w:r>
        <w:t xml:space="preserve"> ordinary text following the field</w:t>
      </w:r>
    </w:p>
 
Parsed in 0.003 seconds, using GeSHi 1.0.8.4


or possibly

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting

    <w:p>
      <w:r>
        <w:t xml:space="preserve"> ordinary text before the field</w:t>
      </w:r>
      <!-- Outer sdt houses an sdt containing the field code, and optionally, the field result.-->
      <w:sdt>
        <w:sdtPr>
          <w:tag w:val="fieldname=field1"/>
        </w:sdtPr>
        <w:sdtContent>
              <w:r>
                <w:fldChar w:fldCharType="begin"/>
              </w:r>
        <!-- Inner sdt houses the field code.-->
          <w:sdt>
            <w:sdtPr>
              <w:tag w:val="fieldcode"/>
            </w:sdtPr>
            <w:sdtContent>
              <w:r>
                <w:instrText xml:space="preserve"> DATE </w:instrText>
              </w:r>
            </w:sdtContent>
          </w:sdt>
              <w:r>
                <w:fldChar w:fldCharType="end"/>
              </w:r>
          <!-- Optional sdt houses the field result.-->
          <w:sdt>
            <w:sdtPr>
              <w:tag w:val="fieldresult"/>
            </w:sdtPr>
            <w:sdtContent>
              <w:r>
                <w:t>12/31/2005</w:t>
              </w:r>
            </w:sdtContent>
          </w:sdt>
        </w:sdtContent>
      </w:sdt>
      <w:r>
        <w:t xml:space="preserve"> ordinary text following the field</w:t>
      </w:r>
    </w:p>
 
Parsed in 0.003 seconds, using GeSHi 1.0.8.4


This has the benefit of giving you a representation in the document that is easier to work with, in the sense that there is a single object you can reference (the outer content control), and it lends itself to processing via JAXB traversal or via XSLT.

It would support nested fields (by nesting them within the field code sdt).

Of course, if you are going to introduce content controls (temporarily - there is no reason to save these in any final docx) it raises the question of why you'd use field codes at all (search for content control data binding), unless you are stuck with these in a corpus of input docx you don't control, or they are mandated in your output docx.

This approach might allow some fusion with docx4j's existing support for content control databinding.

Of course, you still need code which does the content control wrapping. A proof of concept would be pretty easy if you ignored the possibility of nested fields (even though these are a major motivation for this representation).

My XML is a representation for a complex field. But per the spec:
Every simple field implementation for a given field has a corresponding complex field implementation.

Re: using docx4j w/ word 2003 merge fields

PostPosted: Sat Apr 06, 2013 10:13 am
by jason
For the info of anyone who stumbles on this thread via Google, docx4j has supported MERGEFIELD for some time now (over a year at the time of writing).

v3.0.0 will include better support for formatting strings.