I've looked at your PDF problems:
- re fonts, see detail below
- re header/footer - these don't seem to be handled in DocX2HTML.xslt ; if it was, it looks like xhtmlrenderer could deal with it
- re <w:br w:type="page"/>, this should and does translate into <br style="page-break-after:always"> line 3439 (or -before?), and there are xhtmlrenderer posts which say that this is honoured. So I'm not sure where things are going wrong.
As I think I've said before, I don't like using DocX2HTML.xslt to create PDF. I didn't write that xslt - someone at Microsoft did - and I find it difficult to follow. That said, the font problems are mine ...
So, tomorrow morning I'll look at creating a second PDF output method using iText. I'll see where I get with a couple of hours; and it won't try to do anything smart with fonts. It may be enough for you or someone else to think worth expanding.
Back to the fonts ... There are two things to note up front:
1. PDF output is via HTML, so it is useful to look at the intermediate HTML output
(one way to do this is to open the document in docx4all, then export as HTML)
2. there is a font substitution mechanism which tries to use the closest font available on the local system, where closeness is measured by Panose. Sometimes this yields an imperfect result.
The font is supposed to be substituted at the HTML stage, and then embedded at the PDF stage.
WordprocessingMLPackage.pdf() does font embedding following https://xhtmlrenderer.dev.java.net/guid ... tml#xil_32
In my example document in Word, I used font Arial Unicode MS, calibri and times new roman
I explicitly applied the font to the text run, so the hard coded default at line 6299 of DocX2Html.xsl isn't used.
On both Windows and Linux, I saw only š ( š ) in PDF output.
Although all three letters appear in the intermediate HTML output, I think this only tells you that
the UTF-8 character made it into HTML unscathed.
In my case (Linux), Lucida Sans Typewriter replaced arialunicodems in the HTML, and was embedded in the PDF.
Similarly DejaVuSerif replaced timesnewroman.
No substitute was available for Calibri.
Do you know whether your letters are available in these fonts? If not, that would explain the behaviour.
Anyway, I've made a couple of minor fixes, so you could try latest SVN (on Windows), and see whether it helps