Page 1 of 1

Docx Xhtml Importer and URLs/Hyperlinks

PostPosted: Wed Nov 25, 2015 4:24 am
by solid
I have an html document with
Code: Select all
<a href="http://example.com/resource/RESOURCEID">RESOURCEID</a>

in it. I'm not sure why but when I import the xhtml document using the import tool, it converts these urls to lowercase (which my server does not recognize). I need to preserve the original url after docx conversion. Can someone please tell me what I am doing wrong? here is my code:
Code: Select all
   String xhtml = "<html><body>Click your url here <a href=\"http://example.com/resource/RESOURCEID\">RESOURCEID</a></body></html>";
         
         ByteArrayOutputStream baos= new ByteArrayOutputStream();
         WordprocessingMLPackage doc =WordprocessingMLPackage.createPackage();
         
         XHTMLImporterImpl xhtmlImporter = new XHTMLImporterImpl(doc);
          xhtmlImporter.setHyperlinkStyle("Hyperlink");
          doc.getMainDocumentPart().getContent().addAll(
                  xhtmlImporter.convert(xhtml,
                        RestUtils.getBaseUrl()) );
            
          doc.save(baos);



Re: Docx Xhtml Importer and URLs/Hyperlinks

PostPosted: Wed Nov 25, 2015 8:15 am
by solid
Also I just identified another issue with the docx4j importing of html containing anchor tags (a). If the url contains a url encoded value such as %252F, it converts it back to the original character (unencoded), which is a UTF 8 slash in this case.

Re: Docx Xhtml Importer and URLs/Hyperlinks

PostPosted: Thu Nov 26, 2015 7:02 am
by jason
See the code at https://github.com/plutext/docx4j-Impor ... java#L2113

Typically the input into that method is s.getElement().getAttribute("href") so the behaviour you describe is happening either before or after.

Are you running a tidy program?

Re: Docx Xhtml Importer and URLs/Hyperlinks

PostPosted: Sat Nov 28, 2015 2:52 am
by solid
I am not running any tidy program. I have done more testing and after getting your response (Thank you so much by the way) I went back and tested this again. When I mouse over the link, the tooltip in word definitely displays the url as all lower case with the unencoded UTF-8 slash. But if I click on the link (ctrl+click), it does attempt to go to the correct url in chrome (which is my default web browser). However I get a saml validation error now, which appears to only be affecting word. If I open the same document in Libreoffice and ctrl + click it, it opens the correct URL in chrome with no saml validation errors.

This does not appear to be a docx4j problem. By the way, thank your for your work on this project. It is an outstanding API and vastly superior to that other API.