Page 1 of 1

Decoding Masked Characters

PostPosted: Fri Nov 23, 2018 11:45 pm
by dmekonnen
Greetings,

I'm using docx4j 6.0.1 as a new docx4j user. I'm working on a text conversion project for a library of 20+ year old document in .doc format, using fonts of that era. With Word 2016 I open a document and save to the .docx format. The documents appear fine in Word but I've encountered odd issues with encoding.

Unzipping the .docx file and reviewing the word/document.xml file I found that within w:t tags the letter "A" is encoding as 0xF041, for example . I think its something to do with the older document font, I can create an entirely new document with the font, type in "A" and get this result.

Scanning the text string and performing the bit operation: 0x00FF & myChar , will get 0xF041 back to 0x41 as expected. How is this situation normally handled? Does docx4j have utilities to decode text like this?

thank you,

-Daniel

Re: Decoding Masked Characters

PostPosted: Sat Nov 24, 2018 6:49 am
by jason
That sounds interesting. Are you able to attach one of the original binary .doc files?

Re: Decoding Masked Characters

PostPosted: Sat Nov 24, 2018 9:02 pm
by dmekonnen
Hi,

Thanks for the response. I have attached two files, on a single DOC file from the archive, another is a newly created Word 2016 document with just a single letter "A" which also demonstrates the problem. I can attach the relevant font as well if it would be helpful.

thanks again,

-Daniel

Re: Decoding Masked Characters

PostPosted: Tue Nov 27, 2018 3:40 pm
by jason
When I open your folktale doc in Word 2016, I see:

tmp_folktale.PNG
tmp_folktale.PNG (84.77 KiB) Viewed 1467 times


Looks strange to me; maybe the GeezNewA font would help?

Does the same sort of masking happen with "normal" fonts like Times New Roman, Calibri or Arial Unicode?

Re: Decoding Masked Characters

PostPosted: Wed Nov 28, 2018 10:12 pm
by dmekonnen
I've attached the font, though it may not make the document any more intelligible unless you read Amharic :-)

I believe it is encoded as a "Symbol Font". I've attached it zipped since .TTF was not permitted and the file size was two large.

I did find some old regular English .doc files and saved them to .docx via Word 2016. The issue did occur.

thanks,

-Daniel

Re: Decoding Masked Characters

PostPosted: Mon Dec 03, 2018 12:46 pm
by jason
I installed the font, and now the docx looks ok.

I think the Amharic characters are encoded correctly in UTF-8.

The font does contain a glyph for "A", but it also contains the same glyph (i presume) at 0xF041 :

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
        public static void main(String[] args) throws Exception {

                PhysicalFonts.discoverPhysicalFonts();         
                PhysicalFont physicalFont = PhysicalFonts.get("GeezNewA");
               
                if (physicalFont==null) {
                        System.out.println("missing font");
                } else {
                        System.out.println(
                                        GlyphCheck.hasChar(physicalFont, '\uf041'));
                        System.out.println(
                                        GlyphCheck.hasChar(physicalFont, 'A'));
                }
               
        }
 
Parsed in 0.015 seconds, using GeSHi 1.0.8.4


So I think your doc to docx conversion is using '\uf041' instead of 'A'.

(I guess the original characters in the binary doc are also '\uf041' not 'A', but maybe not)

In any case, I think Word 2016 is happily displaying '\uf041', rather than performing the bit operation.

If you did want to use docx to convert characters in the range \uf041 to \uf05a say to normal A-Z \u0041 to \u005a (to make the underlying XML more readable, or so you can use other fonts?), this would be straightforward enough:

- you traverse the content tree applying your bit operation to characters in that range: see for example https://github.com/plutext/docx4j/blob/ ... .java#L275

- you probably want to process main document part, header/footer parts, footnote/endnote pages

Hope this helps .. Jason

Re: Decoding Masked Characters

PostPosted: Tue Jan 22, 2019 11:00 pm
by dmekonnen
Jason, a belated thanks for investigating the font issue here. The older versions of Windows (upto XP I believe) treated the font differently, where the addresses really were in the lower range. I've encountered this shifting in one other font now, I had totally missed it until you pointed it out.

In case you might be tracking projects that utilize docx4j, my project is here: https://github.com/geezorg/DocxConverter -it wouldn't have been possible without docx4j. Thanks for the great support!