Page 1 of 1

Docx4j - Extract to XHTML - Wrong TOC page numbering

PostPosted: Wed Apr 05, 2017 3:07 am
by bef
Hi,

I'm using docx4j on my project to extract DOCX content to xhtml.
It works as expected except for TOC.
My docx file contains a TOC :
word_toc.png
word_toc.png (45.4 KiB) Viewed 1614 times


Table des matières
Approbation du procès-verbal du Bureau du 00 mois 0000 (N°000) 4
Situation générale et financière 5
Exemple de Chapitre 6
Exemple de Section 6
Questions diverses 7



When I call Docx4J.toHTML, I've got this result :
xhtml_docx4j_toc.png
xhtml_docx4j_toc.png (43.9 KiB) Viewed 1614 times


Code: Select all
  <p class="En-ttedetabledesmatires Titre1 Normal DocDefaults "><span class="" style="font-family: 'Cambria';">Table des matières</span></p>
 
  <p class="TM2 Normal DocDefaults "><a href="#_Toc472500769"><span class="Lienhypertexte Policepardfaut " style="font-family: 'Times New Roman';">Approbation du procès-verbal du Bureau du 00 mois 0000 (N°000)</span><span class="Policepardfaut ">   </span><a href="#_Toc472500769"><span>1</span></a></a></p>
 
  <p class="TM2 Normal DocDefaults "><a href="#_Toc472500770"><span class="Lienhypertexte Policepardfaut " style="font-family: 'Times New Roman';">Situation générale et financière</span><span class="Policepardfaut ">   </span><a href="#_Toc472500770"><span>1</span></a></a></p>
 
  <p class="TM2 Normal DocDefaults "><a href="#_Toc472500771"><span class="Lienhypertexte Policepardfaut " style="font-family: 'Times New Roman';">Exemple de Chapitre</span><span class="Policepardfaut ">   </span><a href="#_Toc472500771"><span>1</span></a></a></p>
 
  <p class="TM3 Normal DocDefaults "><a href="#_Toc472500772"><span class="Lienhypertexte Policepardfaut " style="font-family: 'Times New Roman';">Exemple de Section</span><span class="Policepardfaut ">   </span><a href="#_Toc472500772"><span>1</span></a></a></p>
 
  <p class="TM2 Normal DocDefaults "><a href="#_Toc472500773"><span class="Lienhypertexte Policepardfaut " style="font-family: 'Times New Roman';">Questions diverses</span><span class="Policepardfaut ">   </span><a href="#_Toc472500773"><span>1</span></a></a></p>



Page number is always 1.

There is a solution to conserve the original page numbering of the docx TOC?

Thanks in adavance.
Regards

Re: Docx4j - Extract to XHTML - Wrong TOC page numbering

PostPosted: Thu Apr 13, 2017 1:57 pm
by jason
We currently don't implement CSS Paged Media, so the HTML is treated as a single page.

See https://github.com/plutext/docx4j/blob/ ... r.java#L41

A page number value is given in the field result xml:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
            <w:r>
              <w:fldChar w:fldCharType="begin"/>
              <w:instrText xml:space="preserve">PAGEREF _Toc6429102 \h</w:instrText>
              <w:fldChar w:fldCharType="separate"/>
              <w:t>2</w:t>
              <w:fldChar w:fldCharType="end"/>
            </w:r>
 
Parsed in 0.001 seconds, using GeSHi 1.0.8.4


so you could use that, but you'll need to dig into the code to do so.

Re: Docx4j - Extract to XHTML - Wrong TOC page numbering

PostPosted: Thu Apr 13, 2017 6:18 pm
by bef
Thanks for the update.

Regards.