forest icon indicating copy to clipboard operation
forest copied to clipboard

WIP: Archive index design

Open aatifsyed opened this issue 2 years ago • 0 comments

WIP: Archive index design

Background

For the purposes of this discussion, an IPLD Store is a Mapping of CID to (binary) data[^1].

The industry standard for Mappings of this scale is to use a key-value database. Databases may be embedded (co-resident) or client-server (maybe co-resident).

[^1]: Breaking this assumption would unlock using graph databases. I think we should look into that at some point.

Motivate and problem statement

Archival nodes need an IPLD Store for the entire history of a chain. For mainnet, at time of writing this is O(50TiB), with O(50G) CIDs.

A key selling point for Forest is viability as an archival node on commodity hardware.

This is currently implemented using many .forest.car.zst files, but lookup performance is relatively poor, so we should consider other solutions for beta users.

Evaluation criteria

  • Disk usage
  • Incremental updates
    • Unlike e.g snapshot users, archive users aren't going to want to rebuild their database every night.
  • Convenience
    • Import/export tooling
    • Compatability with e.g snapshot files
  • Lookup latency
  • Implementation complexity

Existing IPLD Stores on Filecoin nodes

Every Filecoin node needs an IPLD Store.

Forest engineers have experience with the following IPLD stores.

They are all embedded, presumably because:

  • embedded is simpler/respoects the node boundary.
  • embedded suits this scale of database.

Rocksdb

erDiagram

rocksdb-ipldstore {
  u8[] cid PK
  u8[] data
}

ParityDB

erDiagram

paritydb-ipldstore {
  u8[] cid PK
  u8[] data
}

.forest.car.zst files

erDiagram

cid one -- one or more inline-index: hash
inline-index {
  u64 hash
  u64 offset
}
inline-index one -- one file: lookup
file {
  u64 offset PK
  u8[] data
}

Candidates

MySQL

Datatypes cheat sheet:

MySQL Rust (unsigned)
TINYINT u8
SMALLINT u16
MEDIUMINT u24[^2]
INT u32
BIGINT u64

[^2]: not a core primitive type

Aside: storing CIDs

Which scales better?

Here is a faithful translation of CidGeneric<64>, the default cid::Cid

erDiagram

faithful-cid64 {
  ENUM version PK
  BIGINT cid_codec PK
  BIGINT hash_code PK
  VARBINARY(64) hash_digest PK
}

Most hashes look like this

Keep a table for each variant

erDiagram

base32-cidv1-raw-blake2b-256 {
  BINARY(32) digest PK
}

(or somewhere in between)

consideration: what would require code changes?

Custom Binary Search

Custom Hash Table

SQLite3

Evaluation

Evaluation at mainnet scale

Solution Disk usage Latency Inc Convenience Complexity

aatifsyed avatar Dec 06 '23 03:12 aatifsyed