Page 1 of 1

How to skip table of Contents during parsing

PostPosted: Wed Dec 05, 2012 3:03 am
by zkoppanylist
Hi Everybody,

I parse word document and transfer them into a wiki. It would be nice recognizing Table of Content of Docx documents thus I just use a wiki plugin to provide that and the content of the table of contents from docx could be skipped.

How can I implement that?

Tanks in advanced.

Zsolt

Re: How to skip table of Contents during parsing

PostPosted: Wed Dec 05, 2012 7:11 am
by jason
What approach are you using to parse/transfer into wiki? (XSLT, or something else?)

Of course, you need to detect the start of the ToC, and the end of it.... you can unzip a docx and have a look at it.

To continue this thread, please paste the XML of a short ToC here.

Re: How to skip table of Contents during parsing

PostPosted: Wed Dec 05, 2012 6:54 pm
by zkoppanylist
Hi Jason,

I just use a code similar like below just to traverse through the document (source code attached).

Body body = wmlDocumentEl.getBody();

new TraversalUtil(body, this);

Unfortunately I cannot provide the entire document and have also difficulties editing it via vi or gedit but here are some lines:

<w:tab w:val="clear" w:pos="9072"/></w:tabs><w:spacing w:line="260" w:lineRule="exact"/><w:rPr><w:rFonts w:cs="Arial"/></w:rPr><w:sectPr w:rsidR="007A688C" w:rsidRPr="009D3F87" w:rsidSect="00A52790"><w:headerReference w:type="default" r:id="rId9"/><w:footerReference w:type="default" r:id="rId10"/><w:headerReference w:type="first" r:id="rId11"/><w:footerReference w:type="first" r:id="rId12"/><w:pgSz w:w="11906" w:h="16838" w:code="9"/><w:pgMar w:top="652" w:right="624" w:bottom="652" w:left="1418" w:header="652" w:footer="652" w:gutter="0"/><w:cols w:space="720"/><w:titlePg/><w:docGrid w:linePitch="299"/></w:sectPr></w:pPr></w:p><w:p w14:paraId="680332F4" w14:textId="77777777" w:rsidR="007A688C" w:rsidRPr="008F21AA" w:rsidRDefault="007A688C" w:rsidP="007A688C"><w:pPr><w:rPr><w:rFonts w:cs="Arial"/><w:b/><w:bCs/></w:rPr></w:pPr><w:r w:rsidRPr="008F21AA"><w:rPr><w:rFonts w:cs="Arial"/><w:b/><w:bCs/></w:rPr><w:lastRenderedPageBreak/><w:t xml:space="preserve">Inhaltsverzeichnis </w:t></w:r></w:p><w:p w14:paraId="5DC6A76F" w14:textId="77777777" w:rsidR="0075566C" w:rsidRDefault="007A688C"><w:pPr><w:pStyle w:val="Verzeichnis1"/><w:tabs><w:tab w:val="left" w:pos="480"/><w:tab w:val="right" w:leader="dot" w:pos="9855"/></w:tabs><w:rPr><w:rFonts w:asciiTheme="minorHAnsi" w:eastAsiaTheme="minorEastAsia" w:hAnsiTheme="minorHAnsi" w:cstheme="minorBidi"/><w:noProof/><w:sz w:val="22"/><w:szCs w:val="22"/><w:lang w:eastAsia="de-DE"/></w:rPr></w:pPr><w:r w:rsidRPr="008F21AA"><w:rPr><w:rFonts w:cs="Arial"/></w:rPr><w:fldChar w:fldCharType="begin"/></w:r><w:r w:rsidRPr="008F21AA"><w:rPr><w:rFonts w:cs="Arial"/></w:rPr><w:instrText xml:space="preserve"> TOC \o "1-5" \h \z </w:instrText></w:r><w:r w:rsidRPr="008F21AA"><w:rPr><w:rFonts w:cs=

Processing each paragraph I find content as below:

Inhaltsverzeichnis
TOC \o "1-5" \h \z 1Allgemeines PAGEREF _Toc335729390 \h 8
1.1Projektorganisation PAGEREF _Toc335729391 \h 8
1.1.1Projektverantwortliche/ Projektteam MAHLE PAGEREF _Toc335729392 \h 8

and lot of other lines from table of contents.

The parsing itself is correct however I would like to drop those lines and insert my wiki plugin to show that information.

Thanks!

Regards,

Zsolt