Page 1 of 1

Document merging with Docx4j

PostPosted: Fri Dec 14, 2012 7:52 pm
by gkurady
I have been investigating docx4j for its capabilities for document merging (document assembly could be a more apt word).

Primarily I was looking at the library for following aspects:

1. support for Document Assembly by Copy (equals to MS WORD Insert > Object > Text from File >Insert behavior) : When we do it in the word, this will result in a target document containing all the source contents (including styles, comments etc) in it. So its essentially kind of copy and paste behavior and all the copied source contents in the target document can be manipulated (processed). Is this feature supported by Docx4J?

I am looking for the capabilities to manipulate all the contents in the target document which this feature provides. My findings are that its supported through the MergeDocx extension but not directly. Is that right?

2. AltChunks are processed by word processor itself when the document is opened in the ms word. Does Docx4J has capabilities to process the AltChunks?
My findings with this too is that its supported through extension MergeDocx.

3. Assembly By reference: Similar to MS WORD Insert > Object > Text from File > Inset as Link behavior. Using Docx4j, can we create an target docx which references the source documents? The changes in the source docx can be brought back in the target docx with a refresh. Also this feature allow manipulation (processing) of all the source contents in the target document.

Does Doc4J support this directly, if not Is this supported through the extension?

4. Embedded objects
All three formats, Word, Excel and PowerPoint, can be inserted as embedded objects using OLE. In Word 2007 with is accomplished through Insert > Object > Object... > Create from File.
This leave each file in it's own separate stream inside the .docx.

Does docx4j support this?

Can some one validate my findings and share some thoughts for the questions raised?

Thank you.

Re: Document merging with Docx4j

PostPosted: Fri Dec 14, 2012 8:48 pm
by jason
gkurady wrote:I
1. support for Document Assembly by Copy (equals to MS WORD Insert > Object > Text from File >Insert behavior) : When we do it in the word, this will result in a target document containing all the source contents (including styles, comments etc) in it. So its essentially kind of copy and paste behavior and all the copied source contents in the target document can be manipulated (processed). Is this feature supported by Docx4J?

I am looking for the capabilities to manipulate all the contents in the target document which this feature provides. My findings are that its supported through the MergeDocx extension but not directly. Is that right?


Yes, you can do this using MergeDocx. Or you could write your own code building upon docx4j (which is what MergeDocx does).

gkurady wrote:I
2. AltChunks are processed by word processor itself when the document is opened in the ms word. Does Docx4J has capabilities to process the AltChunks?
My findings with this too is that its supported through extension MergeDocx.


Docx4j can easily add an AltChunk of type docx, but unless you have MergeDocx, it won't be processed until you open the docx in Word.

Docx4j can process AltChunks of type XHTML without needing MergeDocx.

gkurady wrote:I
3. Assembly By reference: Similar to MS WORD Insert > Object > Text from File > Inset as Link behavior. Using Docx4j, can we create an target docx which references the source documents? The changes in the source docx can be brought back in the target docx with a refresh. Also this feature allow manipulation (processing) of all the source contents in the target document.

Does Doc4J support this directly, if not Is this supported through the extension?


Off the top of my head, I don't recall how this is implemented in the file format. Perhaps you could post the XML which represents it (or just a docx which does it), and I'll take a quick look.

gkurady wrote:I
4. Embedded objects
All three formats, Word, Excel and PowerPoint, can be inserted as embedded objects using OLE. In Word 2007 with is accomplished through Insert > Object > Object... > Create from File.
This leave each file in it's own separate stream inside the .docx.

Does docx4j support this?


The old binary formats (doc, ppt, xls) can be inserted this way (represented using OleObjectBinaryPart, under pinned by POI's POIFSFileSystem)

The Open XML formats (docx, pptx, xlsx) are represented using AlternativeFormatInputPart

Re: Document merging with Docx4j

PostPosted: Mon Dec 17, 2012 9:49 pm
by gkurady
Hi Jason,

Thanks for the reply and valuable inputs. Really appreciate it.

With this post, I have attached a target 'merge.docx' file that refers tow source docx namely Test1.docx and Test2.docx. I couldn't post the xml as the file is oversized. If you look at the document.xml for the file, you can see:

<w:p w:rsidR="001208AA" w:rsidRDefault="001208AA" w:rsidP="002F5980">
<w:r>
<w:fldChar w:fldCharType="begin" />
</w:r>
<w:r>
<w:instrText xml:space="preserve"> INCLUDETEXT "D:\\Temp\\Test1.docx" </w:instrText>
</w:r>
<w:r>
contents of source file Test1.docx
<w:r>
</w:p>

And

<w:p w:rsidR="001208AA" w:rsidRDefault="001208AA" w:rsidP="00292F4A">
<w:pPr>
<w:rPr>
<w:b />
<w:color w:val="FF0000" />
</w:rPr>
</w:pPr>
<w:r>
<w:fldChar w:fldCharType="end" />
</w:r>
<w:r>
<w:fldChar w:fldCharType="begin" />
</w:r>
<w:r>
<w:instrText xml:space="preserve"> INCLUDETEXT "D:\\Temp\\Test2.docx" </w:instrText>
</w:r>
<w:r>
contents of source file Test2.docx
<w:r>
</w:p>


So, the files that are being referenced are referred using its path with INCLUDETEXT. The target document xml itself contain all the contents of the source files which makes me believe that when user select the contents for the refresh and presses F9 (refresh key in the ms word), the corresponding contents are retrieved from the source files and re written in the target document. This is applicable not only for contents but also anything its associated with like comemnts.

So , I was wondering if this is achieved through Docx4J? I am not sure if I provided appropriate info you had asked. Please let me know otherwise.

Thanks for the help.

Re: Document merging with Docx4j

PostPosted: Wed Dec 19, 2012 6:38 pm
by jason
Hi, yes INCLUDETEXT "D:\\Temp\\Test1.docx" works in Word 2010 as you describe it.

With docx4j, you have three scenarios.

1. Create an INCLUDETEXT structure, which Word will populate properly on refresh (eg via an auto open macro). You could do this easily enough without MergeDocx

2. Create an INCLUDETEXT structure, properly populated with current data from the docx. MergeDocx will do this for you; you just need to create the field structure that MergeDocx is to populate.

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
    <w:p >
      <w:r>
        <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r>
        <w:instrText xml:space="preserve"> INCLUDETEXT  "C:\\Users\\blagh\\some.docx"  \* MERGEFORMAT </w:instrText>
      </w:r>
      <w:r>
        <w:fldChar w:fldCharType="separate"/>
      </w:r>
    </w:p>

     <!-- use MergeDocx to write content here -->

    <w:p >
      <w:r>
        <w:fldChar w:fldCharType="end"/>
      </w:r>
    </w:p>

 
Parsed in 0.001 seconds, using GeSHi 1.0.8.4


3. Read an existing docx file, and update any INCLUDETEXT encountered. If the INCLUDETEXT is of type docx, MergeDocx can help. See 2 above. You'd need to write some code to parse an INCLUDETEXT field. docx4j can help you find them easily.

Note that it is one thing to populate the INCLUDETEXT structure properly. Downstream applications need to be able to read it. For example LibreOffice Writer 3.5.4.2 opens the document fine, but strips out the field codes, so you won't be able to refresh again.

I also expect that docx4j's HTML and PDF output may need to be improved to render the contents of the INCLUDETEXT

---------

Another option - in addition to the ones you outlined at the start of this thread - is to use a content control to say "insert the contents of this file". That's what the OpenDoPE component model does, as implemented by docx4j (using MergeDocx). This approach makes a lot of sense if you also want document assembly features such as variable insertion, repeats, conditional content, all of which you get out of the box with docx4j.

Re: Document merging with Docx4j

PostPosted: Thu Dec 20, 2012 1:23 am
by gkurady
Hi Jason,

thanks again for your inputs. It gave much clarity on 'merge by reference' aspect for Docx4J.

I am primarily looking for document commentary aspect where I need the capabilities to manipulate the comments inside the target document. Both ALTCHUNK and INCLUDETEXT technique requires end user to use the word capability to populate the contents including commentary (Without using MergeDocx ofcourse :) )

I will do more research on Content Control mechanism and its support for document commentary. Thanks for the idea. If you are already aware of the commentary feature support with content control mechanism, your tips will be very useful.

Thanks.

Re: Document merging with Docx4j

PostPosted: Thu Dec 27, 2012 11:28 pm
by gkurady
Another follow up question:
Does Docx4J has enough support at the low level if we need to process the AltChunks and INCLUDETEXT ourself?

Thanks.

Re: Document merging with Docx4j

PostPosted: Fri Dec 28, 2012 6:26 am
by jason
gkurady wrote:Does Docx4J has enough support at the low level if we need to process the AltChunks and INCLUDETEXT ourself?


Yes. docx4j gives you access to and the ability to manipulate everything in the docx file, the only exception which comes to mind being mc:AlternateContent preprocessing, which happens as the docx is being opened.

Re: Document merging with Docx4j

PostPosted: Tue Jan 08, 2013 8:55 pm
by gkurady
Another option - in addition to the ones you outlined at the start of this thread - is to use a content control to say "insert the contents of this file". That's what the OpenDoPE component model does, as implemented by docx4j (using MergeDocx). This approach makes a lot of sense if you also want document assembly features such as variable insertion, repeats, conditional content, all of which you get out of the box with docx4j.

-hi jason, I am looking at the possibility of using the content controls for document assembly. We are currently not seeking any variable insersion or conditional content or repeats. We want to achieve assembly and ability to process all the assembled content like adding the comments without having to open the document in the word.

I was going through opendope post: http://www.opendope.org/opendope_conventions_v2.3.html. I understand Component document inclusion via using the od:component tag in the sdt will have the document included in the target document. Doesn't it uses the altChunks internally? I get this understanding by looking at the fetchComponents method in the post [ http://www.docx4java.org/svn/docx4j/tru ... ndler.java ] and here sdt is been replaced by an altChunks.
If so, content controls probablyy wouldn't suffice what I am seeking.. i.e. ability to process the merged content (like adding comments in it) in the target document without opening in the word.

Does Docx4J can process(populate) sdt with component document without having to use the MergeDocx?

Can you let me know if my findings content controls are right from Document assembly standpoint? Please let me know If I am missing something.

Thanks.

Re: Document merging with Docx4j

PostPosted: Wed Jan 09, 2013 10:28 am
by jason
Yes, docx4j processing of od:component uses altChunks of type docx internally.

So, yes, for these to be replaced in the resulting docx, you need something capable of processing them. You can write your own processor, or as you know, you can use Word or MergeDocx. Either of these processors should handle comments correctly as well.

Maybe LibreOffice/OpenOffice can also process an altChunk of type docx? You could try that to see...

Re: Document merging with Docx4j

PostPosted: Tue Jan 22, 2013 9:28 am
by jason
Whether you use the OpenDoPE component model or not, I think content controls are the best foundation for what I understand you are trying to achieve.