mkfs.bcachefs defaults to block size 512 for a device with block size 512
Public documentation says the block size defaults to 4096.
bcachefs-tools built at b34d1341919c4edd3251c7f69c2588acd1196d71
I used mkfs.bcachefs rather than bcachefs format, but from reading the code, that should make no difference.
The man page does list defaults for a few things, but not the block size.
Looking at the code, this line:
opt_set(fs_opts, block_size, max_dev_block_size);
could be replaced with:
opt_set(fs_opts, block_size, max(4096, max_dev_block_size));
Or the documentation could be adjusted to say "block_size (default: maximum of all supplied device blocks sizes)".
It would be nice to a have stub man pages for the mkfs.* commands explaining that they are equivalent to ...
Total pedantry: Unless you are checking that all block sizes are powers of two, you should be calculating LCM instead of max, initialising lcm_dev_block_size to 1, not 0, and checking that the user-supplied block size % LCM block size is zero.
Super annoying bug to run into. Was testing bcachefs on a bunch of old drives with 512 byte block sizes and wanted to add some real drives, which had 4096 byte block sizes, only to have bcachefs tell me my fs was formatted with a 512 byte block size, despite the docs saying the default is 4096.
The desireable block size seems case dependent.
(Mixing different types of hardware with differing address translation layers: HDD classic, shingled, zoned vs. flash storage.)
NVMe flash-translation-layer modes with small indirection units (IUs)
"NVMe allows even smaller pages – down to 512 bytes. For write amplification this would be even better" (https://vldb.org/pvldb/vol16/p2090-haas.pdf)
Depending on the hardware the smallest LBA-format mode may not be the fastest, some reports though don't measure it slower (just producuing more heat).
Overall,
- A 512e "emulating" HDD may regularly do internal read-modify-write cycles (a performance penalty (head movements) that can be made unnecessary (worked around) by applying 4k granularity, buffering, and doint "premptive" write-amplification. With HDDs, writing does not cause significant wear.
- Writing to flash memory does cause significant wear. A modern flash-translation-layer controller supposedly doeas a much better wear leveling job with any write that's much smaller (be it 4k or 512 bytes) than its internal, physical writing page register size. Actually using the smallest advertised MIN-IO block size (indirection unit) to fill up the write pages, thus allows to avoid a lot of write-amplification that wears out the flash.
Ideally, could bcachefs map both 512 and 4k byte MIN-IO devices sectors transparantly? Allowing to mix and replace drives in both directions?
In particular, look at flash storage having internal write page sizes >>4k, if they collect and aggregate their smaller LBA block writes.
So with a modern or upcoming SSD/NVMe: Sending a change in the actual smallest logical sector size that the drive supports
(and lsblk --topology reports as MIN-IO) might avoid x-fold write-amplification. That is, when the device then actually gets to fill its write pages only with changed blocks (>=512 Byte) to allocate (instead of getting up to 4k-512 bytes of unchanged data per 512 changed bytes to re-allocate uneccesarily).
(Concerning sync-writes forcing write amplification, by writing half filled pages, or triggering actual read-modify-write cycles: Disabling user-space requested sync writes ("eatmydata"/nosync options) seems a reasonable configuration, if one can still get the ordered CoW guarantees, i.e. whenever "get-consistent-state-from-x-seconds-ago" after a crash is good enough for getting the performance and reduced write-amplification.)