Page 1 of 1

application/octet-stream MIME Type

PostPosted: Thu Aug 31, 2017 4:04 am
by scltul
Hi,

If I create a .docx file using MS Word and run "file -i myfile.docx" on Linux, I'll get the correct MIME type returned.

If I look at its header, it shows the expected: 50 4B 03 04 14 00 06 00 08.

However, once I use io3.Save and ZipPartStore to save the file, running "file -i mynewfile.docx" on Linux will return "application/octet-stream".

If I look at its header, it shows this instead: 50 4B 03 04 14 00 08 08 08.

Is this expected? My problem is that I'm uploading the generated .docx file to a web-based API that verifies the MIME type. Their check (like my Linux test) returns the "application/octet-stream" which then rejects the .docx file as invalid/corrupt.

Has anyone else run into something similar with the generated .docx files from Docx4j?

Thanks,

-Colin

Re: application/octet-stream MIME Type

PostPosted: Thu Aug 31, 2017 11:33 am
by jason
Hi Colin

If you have a look at https://en.wikipedia.org/wiki/Zip_(file_format) you'll see the 2 different bytes are "General purpose bit flag". https://users.cs.jmu.edu/buchhofp/foren ... pkzip.html may explain the meaning of the flags.

But you shouldn't be using those bytes to guess MIME type.

In your upload to the web=based API, have you tried explicitly setting the content type correctly?

See for example https://developer.mozilla.org/en-US/doc ... ntent-Type

application/vnd.openxmlformats-officedocument.wordprocessingml.document

Re: application/octet-stream MIME Type

PostPosted: Thu Aug 31, 2017 11:36 am
by scltul
Just a quick follow-up to my post.

The header that's being recognized as an application/octet-stream is a Zip PK header. It's slightly different (as I showed) because Docx4j uses the Java ZipOutputStream to create the .docx file.

Re: application/octet-stream MIME Type

PostPosted: Thu Aug 31, 2017 4:18 pm
by scltul
Hi Jason,

Thanks for the reply.

The MIME type is being explicitly set, but the API vendor takes the uploaded file (the .docx) and runs an O/S file -i type operation on the file to see if the specified MIME type matches what the O/S returns. If it doesn't, it gets rejected.

Because of the use of ZipOutputStream the header will always be the PK Zip header and not the slightly different MS Word .docx header, as I outlined.

All I can think to do is ask the vendor to adjust their MIME detection to include this type of .docx file.

-Colin

Re: application/octet-stream MIME Type

PostPosted: Sat Sep 02, 2017 3:31 am
by scltul
Just a quick follow-up.

I discovered it wasn't the header in the generated .docx file but the order of one of the parts in the Word .docx file that was throwing off the default magic file's entry for the correct MIME type.

I made a small change to Load3.get() to ensure that the "relationships/officeDocument" type is stored first in the Relationship list.

This satisfied the libmagic's checker (and is the order that MSWord stores it as well).

Hopefully this will help someone else in the future.

-Colin

Re: application/octet-stream MIME Type

PostPosted: Sat Sep 02, 2017 7:28 am
by jason
Hi Colin, thanks for posting. Nice work! So the difference in bytes 7 and 8 is ok?

Re: application/octet-stream MIME Type

PostPosted: Sat Sep 02, 2017 11:05 am
by scltul
Yes, they're fine. It was purely the order for the parts in the file. The magic parser expects the word section to come right after the relationship section.