Page 1 of 1

Get position of CTShape/Pict/CTImageData

PostPosted: Tue Apr 12, 2011 12:10 am
by robertw
Hi,

I'm new to docx4j and I'm currently using it to load and evaluate DOCX documents. The documents I need to read contain a couple of images that may contain labels above (z-level-above) the images.

Now, parsing the files I realized, that the order of how elements appear in MS Word is not the same as I get it from using the TraversalUtil. E.g., I constructed an example document with two images next to each other (image "A" left, image "B" right). Both images have labels, image "A" is labeled "L1" and image "B" is labeled "R2". Traversing the document, I get the images in the order they appear (first "A" then "B"), after that, however, I get the label "R2" (belongs to image "B") then label "L1" (belongs to image "A"). With text paragraphs it gets even more complicated, as the paragraphs may even be returnd in any order.

So, my idea is to use the location of the images (CTShape/Pict/CTImageData) and the texts (P) in order to associate the label texts with the images. But I don't now how to get the location for a P, Pict, CTShape, or CTImageData element. While debugging I saw, that CTImageData has a private parent field that is a JAXBElement having set a style that at least contains width and height. But I don't know how to get this information as there's no getter for the parent on CTImageData.

Can anyone help me, please?

Regards
Robert

Re: Get position of CTShape/Pict/CTImageData

PostPosted: Tue Apr 12, 2011 12:44 am
by jason
robertw wrote:With text paragraphs it gets even more complicated, as the paragraphs may even be returnd in any order.


TraversalUtils traverses the docx in document order (ie in the order the elements appear in the OpenXML); paragraphs should be returned in order.

If you could post the XML for one of your images and its associate label, including the w:p these are contained in, that would help to make this discussion more concrete. (By this I mean unzipping the docx, and opening the document.xml part in a text editor, and copying that)

cheers .. Jason

Re: Get position of CTShape/Pict/CTImageData

PostPosted: Wed Apr 13, 2011 7:34 pm
by robertw
Hi Jason,

I've prepared the example docx file and attached it to this post (the attached document.xml is from that docx file). It contains (left2right, top2bottom):
- 1 company logo
- 2 images ("A" and "B")
- 1 red background image
- 2 image labels ("L1" and "R2")
- 1 line "Some Title"
- 1 line "A Subtitle"
- 1 "Company" name
- 1 Table containing
- 1 line with a date range
- 1 line with a slogan

As I used an extended docx4j's TraversalUtils class in my code I tested the attached file with the OpenMainDocumentAndTraverse example (that uses the original TraversalUtils) on the shell to make sure none of my changes altered the order the elements are traversed.
I got this order from OpenMainDocumentAndTraverse:
- 1 Table (with the two lines for date range and slogan in order)
- a paragraph with
- 2 CTShapes with CTImageData (probably the company logo and image "A")
- 1 Pict with no CTImageData (might be the red background image)
- 1 CTShape with CTImageData (probably image "B")
- a paragraph with 2 R/Pict/CTShape/CTTextbox elements each representing 1 image label in reverse order: first "R2", second "L1"
- a paragraph with the title
- a paragraph with the subtitle
- a paragraph with 2 org.docx4j.wml.R elements
- "Company" line
- 1 Pict with no CTImageData (might be the red background image)

As you can see, the table is traversed first, although it appears on the lower right in the document. Also, the image labels, although in the same paragraph, are traversed in reverse order.

I hope the attached files help you to help me with my challenge: How could I determine the location of the images in the document as well as the location of the image labels to do some magical distance calculation for associating them with each other? Is it even possible?

cheers
Rob

Re: Get position of CTShape/Pict/CTImageData

PostPosted: Thu Apr 14, 2011 1:57 am
by jason
As I said, the traversal returns the elements in the order they appear in document.xml

I've looked at your document.xml, and here is what i found:

- the Table, which includes

Code: Select all
<w:framePr w:w="3402" w:h="1701" w:hSpace="142" w:wrap="around" w:vAnchor="page" w:hAnchor="page" w:x="7939" w:y="12702"/>


- a paragraph with 4 picts;

the first is "image A":

Code: Select all
        <w:pict>
        <v:shape id="_x0000_s1217" type="#_x0000_t75" style="position:absolute;margin-left:4.05pt;margin-top:59.5pt;width:198.55pt;height:301.9pt;z-index:-251661312;mso-position-horizontal-relative:page;mso-position-vertical-relative:page">
            <v:imagedata r:id="rId9" o:title="06_04_Fis_E193_06_21871_jpg_preview_jpeg_preview" cropbottom="3914f" cropright="4460f"/>
            <w10:wrap anchorx="page" anchory="page"/>
          </v:shape>
        </w:pict>


the second is "image B":

Code: Select all
        <w:pict>
          <v:shape id="_x0000_s1218" type="#_x0000_t75" style="position:absolute;margin-left:154.65pt;margin-top:59.5pt;width:476.25pt;height:301.8pt;z-index:-251662336;mso-position-horizontal-relative:page;mso-position-vertical-relative:page">
            <v:imagedata r:id="rId10" o:title="03_01_TMo_E327_05_2_M_jpg_preview_jpeg_preview" cropbottom="3193f"/>
            <w10:wrap anchorx="page" anchory="page"/>
          </v:shape>
        </w:pict>


Code: Select all
        <w:pict>
          <v:rect id="_x0000_s1026" style="position:absolute;margin-left:0;margin-top:369.2pt;width:595.3pt;height:441pt;z-index:-251660288;mso-position-horizontal-relative:page;mso-position-vertical-relative:page" fillcolor="#878c96" stroked="f">
            <w10:wrap anchorx="page" anchory="page"/>
          </v:rect>
        </w:pict>


the fourth is the company logo:

Code: Select all
        <w:pict>
          <v:shape id="Logo_Konzept" o:spid="_x0000_s1028" type="#_x0000_t75" style="position:absolute;margin-left:481.1pt;margin-top:19.85pt;width:94.5pt;height:30.2pt;z-index:251657216;mso-position-horizontal-relative:page;mso-position-vertical-relative:page">
            <v:imagedata r:id="rId11" o:title=""/>
            <w10:wrap anchorx="page" anchory="page"/>
          </v:shape>
        </w:pict>


Note that all those picts are absolutely positioned.

- a paragraph with 2 R/Pict/CTShape/CTTextbox elements each absolutely positioned
first "R2":

Code: Select all
        <w:pict>
          :
          <v:shape id="_x0000_s1213" type="#_x0000_t202" style="position:absolute;margin-left:430.5pt;margin-top:52.05pt;width:133.5pt;height:19.1pt;z-index:251661312" filled="f" stroked="f">
            <v:textbox style="mso-next-textbox:#_x0000_s1213">
              <w:txbxContent>


second "L1":

Code: Select all
        <w:pict>
          <v:shape id="_x0000_s1219" type="#_x0000_t202" style="position:absolute;margin-left:-37.5pt;margin-top:52.05pt;width:133.5pt;height:19.1pt;z-index:251662336" filled="f" stroked="f">
            <v:textbox style="mso-next-textbox:#_x0000_s1219">
              <w:txbxContent>


- the block lower left, inc red text box, a paragraph with the title, and a paragraph with the subtitle

- the bottom right company text box etc

So the reason things appear on the printed page and in Word's page layout mode in a different order to document.xml (and how docx4j sees things) is largely the absolute positioning.

You can read those values from org.docx4j.vml.CTShape, using getStyle().

So you can work out the location of the images easily enough. What is missing at an XML level is any association between an image and the way the labels were done. I suppose you can guess which belongs to which though, by their positions.

Re: Get position of CTShape/Pict/CTImageData

PostPosted: Thu Apr 14, 2011 3:22 am
by robertw
Hi Jason,

thanks for you reply. Am I right that margin-left and margin-top define the x/y location of the v:shape? That would explain things to me.

Also, thanks for the hint on how to get the style. I was desperately looking for that method but at the wrong places.

Cheers
Rob