PMTiles icon indicating copy to clipboard operation
PMTiles copied to clipboard

Alternative dir structure for a more compact storage and proximity clustering

Open nyurik opened this issue 3 years ago • 35 comments

I would like to propose some changes to the directory structure, but these might be totally irrelevant due to my misunderstanding.

Current directory entry is fixed at 17bytes, stores x,y as individual values, and requires x,y sorting order. I think all of those may benefit from a bit of restructuring:

  • Combine x and y values into a single interleaved value (z-curve). This will make tiles more locale-clustered, possibly reducing the number of range requests (web).
  • Make the combined x,y value size depend on the zoom level, rounding to the nearest byte boundary. zooms 0..3 -- 1 byte, 4..7 -- 2 bytes, etc. All leaf nodes are required to be the same zoom, making the directory entries the same size only inside a single directory block, rather than everywhere.
  • For the root dir, it could still be 17bytes (although it seems 6 bytes is a bit too much -- zoom 16..19 only requires 5 bytes, so unless pmtiles wants to support more... and even that can be made flex)

nyurik avatar Aug 03 '22 22:08 nyurik

This is addressed in the upcoming v3 design:

https://github.com/protomaps/PMTiles/blob/spec-v3/spec/v3_design_doc.md

  1. This goes one step further than combining x and y by combining z,x,y into a single value.
  2. I think you are referring to varint-encoding since most values are small? This is also addressed in the v3 design as directory entries are varint-encoded when serialized to disk, although the fixed-width encoding is used for efficient access in memory.

bdon avatar Aug 03 '22 23:08 bdon

see notes in https://github.com/protomaps/PMTiles/issues/41 as well.

experimentally - running https://github.com/protomaps/PMTiles/tree/spec-v3/spec/converter on the z14 planetiler output can get to an effective <2 bytes per tile entry, after RLE + delta + varint + gzip compression steps.

bdon avatar Aug 03 '22 23:08 bdon

Thanks @bdon , this is awesome!!! A few minor questions:

  • Would it make sense to default to Brotli rather than gzip by default? Moving forward, it seems most web serving will switch to Brotli soonish.
  • is there a specification for the TileID? The Hilbert curve usually covers 2D, and I am not certain how the 3D variant will look like
  • Is there a way to encode sequential TileIDs more efficiently, e.g. if you know ahead of time that a block represents sequential tiles and there is no real reason to have each individual tile ID encoded separately?
  • I love the RunLength idea - makes encoding of the identical items easier. How much space does it use when the run length is zero?

nyurik avatar Aug 04 '22 00:08 nyurik

  • I'm leaning towards making the index and tile compression be configurable, with the compression type stored in header. Brotli does not have a breadth of implementations across different languages and environments like gzip; PMTiles clients will use https://github.com/101arrowz/fflate for decompression which works across browsers and Lambda.
  • The specification is just the 2D tile IDs arranged in sequence where Z0 is at TileID=0, Z1 starts at TileID 1, Z2 starts at TileID 5 and so on.
  • I think you mean a "dense" representation like https://github.com/openaddresses/TileBase (and I think https://github.com/mactrem/com-tiles/tree/main/packages/spec has some provisions for this as well).
    • In general, I think adding sparse directories complicates the design because then you need to store whether a directory is sparse or dense and branch the logic based off of that. I think a sparse directory + delta encoding of TileIDs accomplishes the same thing after compression, because a complete block of sequential tiles will have "1" as the delta-encoded TileID which should be easy to compress out in the gzip step.
    • The runlength=1 is the general case (runlength=0 is used to indicate a leaf pointer), and it will add an additional 4 bytes to every entry in memory (we could use 2 bytes, meaning the max run length is 65535, which is exceeded empirically in a planet-scale ocean at z14ish, so that would require multiple entries). But again, with varint+gzip compression this should be optimized out in the directory serialization step.

So in general, the current "v2" design punts on compactness optimizations by having a 1:1 match between directory representation at rest and in memory. Of the 4 compression steps I want to add: RLE, Varint, Delta, GZIP, the Varint+Delta parts should do a sufficient job of removing redundancy.

bdon avatar Aug 04 '22 00:08 bdon

Thanks, it does make sense. Have you had any trial runs comparing the space cost of sparse for a planet-file vs dense? You could also have a bit flag field to indicate the presence of fields, plus a new value with the number of dense entries to follow, just like with the run length. The entry would then have tileid, optional blocklength (N) (default=1), optional runlength (default=1), plus an array of N offsets+lengths. TBH, I don't think this will make the design significantly more complicated given how many decoding steps are already required. But again, this might be a premature optimization until you can run a space test.

Do you have a sample code of decoding tileID? Seems like it would have to be a weird one like

if id < 2^0:
  return 0,0,0
if id < 2^1:
  x, y = decode(id - 2^0)
  return 1, x, y
if id < 2^2:
  x, y = decode(id - 2^1)
  return 2, x, y
...

nyurik avatar Aug 04 '22 01:08 nyurik

Yes, that decoding routine is roughly right, though you can use a for-loop and an accumulator:

https://github.com/protomaps/PMTiles/blob/spec-v3/spec/converter/tile_id.go#L71

I think the dense format would complicate the encoding routine, which is currently fairly simple for those 4 steps: https://github.com/protomaps/PMTiles/blob/spec-v3/spec/converter/convert.go#L53 and I have to implement across at least six different languages. Any dense run would be terminated by a RunLength > 1 since the TileIDs of two consecutive entries would not be contiguous. It's worth experimenting with and maybe including if the size savings are significant; the current design achieves a total compressed directory size of 80-90 MB for a z14 planet basemap dataset which I'm satisfied with.

bdon avatar Aug 04 '22 02:08 bdon

Thinking a bit out loud, the way I would probably do this is to re-use RunLength for dense sequences: a positive RunLength N means that N (offset,length) pairs follow corresponding to sequential Tile IDs, a negative RunLength N means the next -N tile IDs are an identical offset/length.

bdon avatar Aug 04 '22 02:08 bdon

LOL, I was thinking along the same lines :))) RunLength should be moved up though, before the offset/len fields (just to keep it logical).

nyurik avatar Aug 04 '22 03:08 nyurik

WRT decoding tileid -- feels expensive tbh - the higher the zoom, the more steps are required to parse it. That said, tight loops without memory access are very fast, so might not be too bad.

nyurik avatar Aug 04 '22 03:08 nyurik

@msbarry has benchmarked the decoding/encoding routine when the Hilbert code was still in here https://github.com/onthegomap/planetiler/pull/266/files#diff-9e784e32e5ed72b6cd67c89789e4b4f0fe3082810b9f839b357b0ccd11de58d4 and the decoding step was trivial relative to other things happening.

bdon avatar Aug 04 '22 03:08 bdon

We found a hilbert implementation fast enough that computing the start offsets was a bottleneck, the fastest way we found to do it was to precompute a lookup table, then iterate through it backwards to get zoom for an index, something like:

  private static final int[] ZOOM_START_INDEX = new int[16];

  static {
    int idx = 0;
    for (int z = 0; z <= 15; z++) {
      ZOOM_START_INDEX[z] = idx;
      int count = (1 << z) * (1 << z);
      idx += count;
    }
  }

  private static int startIndexForZoom(int z) {
    return ZOOM_START_INDEX[z];
  }

  private static int zoomForIndex(int idx) {
    for (int z = 15; z >= 0; z--) {
      if (ZOOM_START_INDEX[z] <= idx) {
        return z;
      }
    }
  }

msbarry avatar Aug 04 '22 09:08 msbarry

If the entire leaf is the same zoom, can we just keep the zoom separate?

nyurik avatar Aug 04 '22 13:08 nyurik

There is no concept of zoom in leaves as that is implied by the TileId. A directory might cross zoom levels; this can be as many as eight zoom levels in the root. The TileId encodes (z,x,y) together so that a set of tiles at different zoom levels can be represented with only a unique set of integers.

bdon avatar Aug 04 '22 13:08 bdon

I was referring to All leaf directories in a single directory must have the same Z value. (new in v2). rule. Come to think of it, its not a big deal because 1) its delta encoded, and 2) decoder is unlikely to decode the value anyway - most often it will be a "encode and lookup", rather than iterate over existing. And even if it is decoding, it would cache the current zoom and just check the bounds. So yes, a single value, while not intuitive, might work just as well.

nyurik avatar Aug 04 '22 15:08 nyurik

P.S. once the v3 spec is stable, I could try to hack this in Rust unless someone already has implemented it (i haven't found it).

Also, it might make sense to get some feedback on v3 spec from those who worked on OSM pbf data and similar formats - i.e. @b-r-u @joto @pka @bjornharrtell ...

nyurik avatar Aug 05 '22 01:08 nyurik

There isn't much overlap between the OSM PBF format and PMTiles, as PMTiles is restricted purely to Z/X/Y addressing and not geometry or geographic coordinates - for that you might be interested in https://github.com/protomaps/OSMExpress which is meant to mirror OSM PBF transactionally on disk.

It's worth attempting to benchmark what a change in the entry data structure would look like to encode non-duplicates as runlengths - this would make the underlying entry struct non-fixed-width, because it would have a nested array of Offsets/Lengths for each TileID. My hypothesis is that this could save 20-30% on the in-memory representation at the cost of making access and logic more complicated. I'm not sure if the 20-30% would translate to savings on disk after compression because the relevant columns (RunLength, TileID) are both delta-encoded and laid out in columns in the current draft implementation, so they should compress almost perfectly.

bdon avatar Aug 06 '22 18:08 bdon

"Nested entries" for serialization in 5ec3a8188e2cfc5f9e5a23d88ac1c1f793ab457b

this counterintuitively increased the total compressed index size from ~92 MB for a planet z14 basemap tileset to ~101 MB. Double checking if I made any errors in the implementation...

bdon avatar Aug 07 '22 09:08 bdon

Can it be due to negative numbers brings stored as 8 bytes each, with all the high bits set?

nyurik avatar Aug 07 '22 10:08 nyurik

No, because the signed varint encoding is zigzag-encoded, so they should still be small.

The root cause here is another optimization which zeroes out offsets if the tile data is contiguous, leaving only the length. The code as-is only does the zeroing within a Run, while the previous code worked across all Entries in a serialized directory. I've changed this to work across all offsets like before; the compressed size beats the non-nested very slightly on a small Tippecanoe dataset (1438 -> 1426 bytes), benchmarking this on planet now

bdon avatar Aug 07 '22 11:08 bdon

Nested method is 91.3 megabytes total on test dataset vs 92.8 megabytes non-nested

bdon avatar Aug 07 '22 13:08 bdon

Thanks @bdon for doing the analysis! So 1.6+% savings but only applicable to the metadata. It does make online access slightly faster, but doesn't seem to be significant enough to make a difference, unless we can come up with some trick to optimize it even further.

If it is not too hard, could you try Brotli compression, just to get a ballpark estimate of how much we could potentially save in the future? Not a biggie though.

nyurik avatar Aug 07 '22 17:08 nyurik

It shouldn't be hard, I will also try zstandard as well tonight, though I imagine the savings will be marginal for small files + high entropy numbers instead of strings

bdon avatar Aug 07 '22 18:08 bdon

running brotli test in https://github.com/protomaps/PMTiles/tree/brotli

bdon avatar Aug 07 '22 19:08 bdon

Note on the nested memory savings:

With the existing non-nested implementation, the complete planet has about 40,000,000 entries, where each entry is 8 + 4 + 8 + 4 (tileid,runlength,offset,length) bytes = ~960 MB in-memory, which seems reasonable to me, given that the purpose of this design as a whole is not needing to have the entire index in memory at once.

By shortening Runlength to 2 bytes (the vast majority are just zeroes) instead we can shave this down to ~880 MB, but that doesn't seem like a huge win.

bdon avatar Aug 07 '22 20:08 bdon

Brotli reduces the index size from 92.8 megabytes to 73.6 megabytes with the compression quality set to 11 (slowest), which is nice savings; it won't be practical until most runtimes (python, node, browser via DecompressionStream web standard) support it though

bdon avatar Aug 07 '22 22:08 bdon

zstd with the default compression ratio (3) produces a 91.2 MB index, so almost the same as gzip; the higher compression levels are not implemented in this pure-Go port, and support is generally behind Brotli

bdon avatar Aug 08 '22 10:08 bdon

Nice! <1GB for the in-memory representation is great - a tile server could easily hold that in memory and serve the raw tiles with a single lookup.

For planetiler, I looked into brotli support for java, it's really close but might need some pushing to make it generally usable. There is code for an official java wrapper around the native brotli library but it's not published to maven central. jvm-brotli was publishing it, but it looks abandoned now, and brotli4j might be a better alternative - but it doesn't support m1 macs yet so I can't use it... Here's a recent reddit discussion.

My 2 cents would be to stick with gzip for compatibility, and work towards brotli when there's broad enough support across browsers and languages...

msbarry avatar Aug 08 '22 11:08 msbarry

here's the real-world z14 test dataset I'm using to benchmark compression:

https://protomaps-static.sfo3.digitaloceanspaces.com/dir.bin.gz

The gzipped file is 293 MB, the ungzipped file is 981 MB. The file is simply a sequence of TileId uint64, RunLength uint32, Offset uint64, Length uint32 in little-endian order: 24 bytes per entry: 981,227,232 bytes total size = 40,884,468 entries total.

So a naive gzipping of this dataset is 981/293 = 3.34x compression ratio.

The PMTiles compression technique as is results in a 92 MB unpartitioned directory, a ~10.6x compression ratio.

Analysis of how each compression step affects the result (removing one step at a time from the complete algorithm)

Delta encoding of TileID: 171 MB Varint encoding: 118 MB Columnar transformation: 102 MB Zeroing of contiguous offsets (only effective in a clustered archive): 182 MB General-purpose gzip compression: 243 MB

So in order of most to least effective of these 5 steps: gzip, offset zeroing, delta encoding, varint encoding, columnar

bdon avatar Aug 08 '22 19:08 bdon

I keep thinking that it might make sense to significantly reduce memory footprint and use u16 instead of 32. A tiny change to the encoding logic is worth 15% server memory reduction

nyurik avatar Aug 08 '22 21:08 nyurik

@nyurik for most languages in practice I think it will not make a difference since the struct as a whole will need to be aligned to 8 bytes on 64 bit systems; 22 will get padded to 24 bytes. can you check this in rust?

bdon avatar Aug 08 '22 22:08 bdon