dbhub.io [Suggestion] Upload / Download Compressed Databases

Just a thought, but if I wanted a 68 MB database, that could be a long transfer. Uploading a 68 MB database is even slower - uploads are (normally) more slower than downloads. SQLite databases, by their nature, are relatively text-heavy. Is there any way a database can be zipped (or a Linux equivelant?) prior to uploading/downloading? Uploading an 18 MB .zip would be easier than a bloaty 68 MB database... but there is a barrowload of work to do 'server side' to accept ZIPs.... plus the potential for trouble.... Anyway. Thought I'd throw it out there.

Aug 30 '17 10:08 chrisjlocke

Ahhh, yeah. That's a good idea too.

With DB4S, it shouldn't be too hard to get that happening. I just tried compressing the DB4S download stats database (68MB), using bzip2 -9, and that compressed down to 12MB.

With browser uploads though, that might take some more investigation. It'll likely have something to do with negotiation between the browser and server, such that the browser knows it can compress uploads on the fly. Do you have any interest in investigating that? :smile:

Aug 30 '17 10:08 justinclift

such that the browser knows it can compress uploads on the fly.

I didn't know if it was easier allowing users to upload .ZIP files, rather than add an extra layer of browser incompatibility... someone mgbt be using an older browser, causing weird freaky errors. But if they just threw up a .zip, dbhub.io could uncomroeess that.

I'll have an investigation though. Will rummahe around...

Aug 30 '17 10:08 chrisjlocke

Oops. Oh yeah, uploading .zips should be easy to implement. We'd want to be careful so that the unzip code checks for zip bombs first, but apart from that it should be workable. :smile:

Aug 30 '17 10:08 justinclift

From a quick look, protecting against zip bombs shouldn't be too hard.

https://stackoverflow.com/questions/1459080/how-can-i-protect-myself-from-a-zip-bomb

Something along the lines of:

Read the entries in the .zip file
Ensuring there's only database file in there
- Hmmm, if there's a README or similar that'd make for an easy way to get meta-data filled out
Ensure the size of the entries aren't "over size" according to whatever limitations are set for the server at the time

Aug 30 '17 12:08 justinclift

Some variation of this looks like it could be useful for generating test data to make sure things are ok:

https://github.com/abdulfatir/ZipBomb

Aug 30 '17 12:08 justinclift

Hmmm, apparently the size headers for entries can be mucked around with to give incorrect sizes. 😦

https://github.com/thejoshwolfe/yauzl/issues/13#issuecomment-96789940

Aug 30 '17 12:08 justinclift

You have to admire people to 'break things' so well. The makers of GTA were always impressed people had done things with the game they'd never thought of. Creating a 30 KB file that uncompresses to such a huge size.... argh! You have to code everthing so tightly to ensure the cheeky gits can't worm their way in.

Aug 30 '17 13:08 chrisjlocke

Yep. :wink:

For a naive first implementation, probably streaming the uncompressed data nowhere on a first run, just counting the output size would work. With a limiter that stops somewhere useful (100MB is the current max per upload on the server). If it finishes ok without hitting the limit, then do the uncompress for real.

Aug 30 '17 13:08 justinclift

You'd still have to be careful (or delete files after). If it stopped after 100 MB, Mr Malicious Git could perform 20 simultaneous uploads, wiping out disk space, etc.

I tried downloading the ZIP file on that gitHub site, but it uncompressed to 10000000000 GB... ;)

Aug 30 '17 13:08 chrisjlocke

Yep. We'll figure it out. :smile:

Aug 30 '17 13:08 justinclift

Just tried out a few compression formats, as potentials for providing to people via the download button.

The source data is the 2013 (only) ~710MB Backblaze drive stats database:

https://db4s-beta.dbhub.io/justinclift/backblaze-drive_stats.db?commit=bd20928c99826ad56430ff517465bac7d179c09381309dd11256f696e03aebf0

gzip

$ time gzip -9k backblaze-drive_stats.db

real	0m28.655s
user	0m26.870s
sys	0m0.799s

$ ls -lh backblaze-drive_stats.db.gz
-rw-r--r--. 1 jc jc 93M Aug 31 15:34 backblaze-drive_stats.db.gz

zip

$ time zip -9 backblaze-drive_stats.db.zip backblaze-drive_stats.db
  adding: backblaze-drive_stats.db (deflated 87%)

real	0m24.254s
user	0m23.962s
sys	0m0.244s

$ ls -lh backblaze-drive_stats.db.zip 
-rw-rw-r--. 1 jc jc 93M Sep  4 14:25 backblaze-drive_stats.db.zip

bzip2

$ time bzip2 -9k backblaze-drive_stats.db 

real	1m40.247s
user	1m38.432s
sys	0m0.594s

$ ls -lh backblaze-drive_stats.db.bz2 
-rw-r--r--. 1 jc jc 65M Aug 31 15:34 backblaze-drive_stats.db.bz2

xz (multi-threaded mode on an old quad core i5-750)

$ time xz -k -9e -T 0 -vv backblaze-drive_stats.db
xz: Filter chain: --lzma2=dict=64MiB,lc=3,lp=0,pb=2,mode=normal,nice=273,mf=bt4,depth=512
xz: Using up to 4 threads.
xz: 4,997 MiB of memory is required. The limiter is disabled.
xz: Decompression will need 65 MiB of memory.
backblaze-drive_stats.db (1/1)
  100 %        41.5 MiB / 713.0 MiB = 0.058   3.1 MiB/s       3:51             

real	3m51.213s
user	13m19.875s
sys	0m4.855s

$ ls -lh backblaze-drive_stats.db.xz
-rw-r--r--. 1 jc jc 42M Aug 31 15:34 backblaze-drive_stats.db

xz (single threaded mode on the same old quad core i5-750)

$ time xz -k -9e -vv backblaze-drive_stats.db
xz: Filter chain: --lzma2=dict=64MiB,lc=3,lp=0,pb=2,mode=normal,nice=273,mf=bt4,depth=512
xz: 674 MiB of memory is required. The limiter is disabled.
xz: Decompression will need 65 MiB of memory.
backblaze-drive_stats.db (1/1)
  100 %        41.3 MiB / 713.0 MiB = 0.058   909 KiB/s      13:23             

real	13m23.504s
user	13m21.292s
sys	0m0.761s

$ ls -lh backblaze-drive_stats.db.xz
-rw-r--r--. 1 jc jc 42M Aug 31 15:34 backblaze-drive_stats.db.xz

p7zip (*nix version of 7zip)

$ time 7za a -t7z -m0=lzma2 -mx=9 -mmt=2 -mfb=64 -md=64m -ms=on backblaze-drive_stats.db.7z backblaze-drive_stats.db

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Core(TM) i5 CPU         750  @ 2.67GHz (106E5),ASM)

Scanning the drive:
1 file, 747593728 bytes (713 MiB)

Creating archive: backblaze-drive_stats.db.7z

Items to compress: 1

                                 
Files read from disk: 1
Archive size: 50809124 bytes (49 MiB)
Everything is Ok

real	4m52.235s
user	6m50.903s
sys	0m10.461s

$ ls -lh backblaze-drive_stats.db.7z
-rw-rw-r--. 1 jc jc 49M Sep  4 15:46 backblaze-drive_stats.db.7z

Compression wise, xz in multi-threaded mode looks like the winner.

Sep 04 '17 15:09 justinclift

Never heard of xz. (but I'm on Windows, so .ZIP is god!) 7zip is munching ahead, but .rar is semi-god. I presume this depends on the OS... someone isn't going to install something just to upload a database.

Sep 04 '17 15:09 chrisjlocke

Yeah. In the *nix world "gzip" (not compatible with .zip) and "bzip" are very widespread. "xz" uses the same compression as 7zip, but a different on disk format.

For upload, I guess we should support .zip at the very least. For downloading, I'm thinking ".zip" (widespread) and "xz" (for maximum compression option).

Sep 04 '17 17:09 justinclift

See https://www.blender.org/download/ as an example of how you could provide formats. I usually go for a good .tar.bz2 when available and fallback to zip which should always be available. As Chris says nobody is going to download 7zip to get access to 1mb super compressed file faster when the 10mb zip is there.

Sep 04 '17 17:09 iKlsR

@iKlsR Ahhh, that seems like a reasonable approach. Thanks. :smile:

Sep 04 '17 17:09 justinclift

@MKleusberg @chrisjlocke As a data point, I just added gzip compression support to the DB4S connection handler code running on DBHub.io.

It seems to be working fine, and it also seems to speed transfers up a lot. :smile:

Jul 09 '20 10:07 justinclift

It does however look like the progress bar in DB4S doesn't work anymore. Probably because the download size isn't known in advance.

Jul 09 '20 10:07 MKleusberg

Oh. Damn. We might need to use curl or something to capture the response headers and see what's being transferred. I was under the impression it'd be "transparent" so wouldn't change anything. :frowning:

Jul 09 '20 10:07 justinclift

As an alternative thought, instead of doing streaming compression it might be feasible to gzip compress the requested database on the fly completely first, then transfer the compressed file.

That might allow for the transfer to give an accurate count to the # of bytes expected to be sent over the wire.

For small databases (eg most of them), there difference in start time for the download wouldn't be noticeable. Larger ones though... it probably would be.

We should probably investigate what info is actually being sent already first, and in what order, just in case there's a way for the streaming approach to work. :smile:

Jul 13 '20 22:07 justinclift