apertium-html-tools Bandwidth-efficient document translation

Hi! I just wanted to share the following idea:

Basics:

Odt and docx files are ZIP archives
Only the textual content is needed for translation
Files within ZIP archives are compressed individually

The following process could therefore (perhaps dramatically) reduce the network bandwidth required for document translation:

Read the entire document file into RAM using javascript
Copy the (compressed) file chunks corresponding to the textual content to a new data structure and attach a new ZIP header
Submit the stripped-down (but still correct) ZIP file for translation
Reintegrate the response into the original ZIP and update the header
Put everything in a data URL and let the user save it to disk

No API-changes would be required, because the client-side script basically just strips unneeded content (e.g. pictures) from the ZIP (i.e. odt or docx) file and apertium does not care about those anyway.

Feb 19 '16 16:02 roybaer

Isn't everything gzipped automagically when the server says that it supports compression?

Feb 21 '16 15:02 unhammer

That is true, but misses the point.

The actual point is, that docx and odt files are ZIP files already and that a javascript could leave unneeded parts thereof out of the transmission with relative ease, i.e without any de- and recompression, since files within a ZIP are independently compressed bitstreams.

Translating an odt file with e.g. 20 KB (compressed) textual content and 5 MB jpegs would therefore only send those 20 KB to the server in the first place. The client-side javascript would store the 5 MB of translation-irrelevant deflate-compressed file bitstreams locally and then reintegrate them into the final odt (internally ZIP), once again exploiting the internal structure of the ZIP file format to avoid any de- and recompression.

And thanks to those formats' internal ZIP compression I strongly hope that the current configuration does not try to gzip odt and docx files, because that would obviously be an utter waste of CPU time.

Feb 21 '16 21:02 roybaer

OK, I see the point about leaving images out – patches are welcome :-) as long as they're well-tested; this seems like it could easily have unintended consequences (apertium already is quite brittle when it comes to preserving document structure).

But I don't agree that it's worthwhile trying to avoid gzipping zip files just to save some CPU time – the compression is even in the HTTP spec https://en.wikipedia.org/wiki/HTTP_compression and best left to the browsers to deal with. The algorithms for HTTP compression were chosen because they are known to be CPU-light, so we as developers don't have to make per-file tradeoffs. Also, when I put all the odt files on my system into a folder and make a tar.gz out of it, it goes from 7.9M to 7.3M (and spends only 0m0.350s user(cpu)-time), so even if it didn't add to the complexity of the code, I would still keep gzipping zips.

Feb 22 '16 08:02 unhammer