md: document how to use software RAID with CoreOS
People may want to software RAID CoreOS. Can we do this today? Can we help with the coreos-install script?
/cc @marineam
No, I would rather not do raid in the current installer script. If we want to start working on a full featured Linux installer we should do that in a language that isn't shell.
What about software raid outside of the installation script?
At least it would be useful to document what needs to be done, if anything, besides building the raid and running the installer against it. I tried this but couldn't get my machine to boot.
@robszumski software raid is what we are talking about.
@jsierles sounds like we have bugs to fix because my intent is to make that work.
Any news on the software raid documentation?
Would be rather useful
Also would see this as very useful functionality.
Yes, it would be great :)
@philips I saw this commit But yeah... Does anybody can tell me where to start if I want raid software? emerge mdadm ??
@pierreozoux mdadm is included base images but we haven't played with it at all. Setting up non-root raid volumes should work just the same as any other distro. Same ol' mdadm command for creating and assembling volumes. You may need to enable mdadm.service if you want to assemble volumes on boot via /etc/mdadm.conf as opposed to using the raid-autodetect partition type and letting the kernel do it. It might be possible to move the root filesystem as long as the raid-autodetect partition type is used but for that you are almost certainly better off with using multi-device support in btrfs.
What certainly won't work right now is installing all of coreos on top of software raid, the update and boot processes both assume the ESP and /usr partitions are plain disk partitions.
@marineam Would this constraint of CoreOS also apply to btrfs-raids?
@brejoc multi-device btrfs for the root filesystem should work
What about migrating after install. Eg to RAID 1 from installed /dev/sda (one partition sda1 for demonstration) should be something like this from a Rescue CD or similar:
sfdisk -d /dev/sda | sfdisk /dev/sdb
sfdisk --id /dev/sdb 1 fd
mdadm --zero-superblock /dev/sdb1
mdadm --create /dev/md0 --level 1 --raid-devices=2 missing /dev/sdb1
mkfs.btrfs /dev/md0
mkdir /mnt/source;mount /dev/sda /mnt/source
mkdir /mnt/target;mount /dev/md0 /target
cp -a /mnt/source/* /mnt/target
Thereafter the disk mount configuration needs to be changed and the kernel root device in the bootloader, as well as the bootloader installed to both disks.
modify /mnt/target/etc/fstab Replace /dev/sda1 with /dev/md0 - but this is non-existent on CoreOS bootloader since 435 seems to be GRUB which helps but I cannot find a grub binary only config in /usr/boot
Thoughts?
@warwickchapman just in case you finished your exploration into this topic and came up with a complete solution - or if someone else has - I'd appreciate if you shared it. I know too little about setting up and messing with RAID / mounts / boot in order to complete this myself. It's not a hard requirement for my use case but it would help being able to have RAID to be able to use both/all disks in a system. I understand it's also possible to set up a distributed file system like Ceph and let it manage the disks without RAID, and that would work for the use cases I have in mind, but for now I'm happy about any additional complexity I can avoid!
As noted on IRC, for btrfs if raid0 or raid1 is all you need then it is easiest to just add devices to btrfs and rebalance: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
As for md raid if the partition types are the raid-autodetect type then the raid volume will be assembled automatically. But you can only put the ROOT filesystem on raid, we don't currently support putting the other partitions on anything other than plain disk devices.
@marineam Perfect -- again thanks for the pointer, that was all I needed! Here's a gist with instructions, a script and helper file, plus some reference links to help people get this done quickly and easily :) I've verified that my instance reboots just fine but haven't checked beyond that if I might have messed up things, which could easily be the case given I'm not experienced at messing with the Linux file system!
https://gist.github.com/seeekr/1afa1e5ce3ad6e998367
Thanks, very interesting - I left it at the point I got to and have stuck with OpenVZ for now. Will start testing again.
forgive me my ignorance - does it mean that if i add drives using your script from the gist - i dont need to put any mount units in my cloud-config. Right now i'm testing it on virtualbox installation and looks like btfrs can see all drives (sudo btrfs fi show) after restart and no mount units.
@agend07 when adding devices to a btrfs filesystem they become a required part of that filesystem so all of them need to be available in order to mount the filesystem in the first place. The discovery of the devices happens automatically so there isn't any extra configuration.
@agend07 I am not that knowledgeable about btrfs (and CoreOS) myself, but as far as I can tell no other changes are necessary, i.e. no additional mount points, and things just keep working after a restart. From the btrfs docs I also get the matching impression that btrfs is a "self-managing" system for lack of a better term.
all clear now - thanks, i was just afraid that even after restart it works, it could stop working after system upgrade without something special in cloud-config. Now I can sleep better.
i believe docs are little misleading on this topic:
https://coreos.com/docs/cluster-management/debugging/btrfs-troubleshooting/#adding-a-new-physical-disk - links to: https://coreos.com/docs/cluster-management/setup/mounting-storage/ which looks like mount unit in cloud-config is the only way.
I'd probably never got it working without finding this issue.
@agend07 ah, yes, that is misleading, you either would want to mount device(s) as an independent volume or add them to the ROOT volume, not both. Also, referencing that ephemeral storage documentation in the context of adding devices to ROOT is also bad. You do NOT want to add ephemeral devices to the persistent ROOT because the persistent volume will become unusable as soon as the ephemeral devices are lost.
@robszumski ^^
@agend07 I'm a little unclear what was misleading, a PR to that doc would be greatly appreciated :)
@robszumski I'm not a native english speaker, and I'm not always sure if i understood everything correctly, and I'm probably not best person to write docs for other people, but:
here are the steps which worked for me:
- find the new drives' name with 'sudo fdisk -l', lets say it's /dev/sdc
- create one partition on this drive with 'sudo fdisk /dev/sdc' - then 'n' for new partition, choose all defaults with enter, then 'p' to see the changes, 'w' to write them to disk and quit fdisk
- 'sudo mount /dev/disk/by-label/ROOT /mnt'
- 'sudo btrfs device add /dev/sdc1 /mnt'
- 'sudo btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt' - with link to btrfs-balance(8) man
- 'sudo btrfs fi df /mnt' - to see if it worked
- 'sudo umount /mnt' - clean up
Easiest thing to do would be:
- remove the link to "mounting storage" from "adding a new physical disk"
- add link to seeekr's gist: https://gist.github.com/1afa1e5ce3ad6e998367.git
- add comment - that if all u need is raid 0, 1 or 10 + snapshots (5 and 6 are not stable as far as i understand) - you dont need to mess with software raid, lvm - btrfs has it all and more. Which is basically marineam comment from above, starting with "As noted on IRC ..."
actually another marineam comment starting with "What certainly won't work right now" says that coreos on top of software raid would not work at all - its aug 13 comment, not sure whats the status for today
I understand making docs that everybody would find helpful is not an easy task. Thanks for your work, and btw - can u speak polish, cause your lastname sounds polish.
Maybe I'm missing something, but when I install CoreOS using the latest stable release I get a large ext4 filesystem on sda9, not btrfs. Is the information in this thread outdated or beta-only?
@tobia check this: https://coreos.com/releases/#561.0.0
Is there a guide for a root filesystem on raid1 now that we are on ext4?
@agend07 thanks, it wasn't obvious from the rest of the documentation. I didn't understand what all that talk about btrfs was coming from! So let me add my voice to those asking for support for SW RAID1 on the root fs. Rotating disk failure in a server is a very common occurrence. Many leased bare metal servers come with two identical disks for this very purpose, but not with a HW RAID controller, which can have a monthly fee as large as the server itself.
It makes sense to let the user setup the raid themselves with mdadm, because the configurations are too many to have a script handle them. But then the install script, the boot process, and the update should accept—and keep—the given mdX as the root device.
Haven't tried this in a very long time but it should be possible after writing the base disk image to change the ROOT partition type to raid autodetect, wipe the existing FS, and set up a md device on it, and then create a new filesystem, label it ROOT, and create a /usr directory in that filesystem. The rest of the fs should get initialized on boot. There is a major limitation though: we don't have a mechanism for applying updates to USR-A/USR-B across multiple disks or on top of a md device. This means that although you can use raid for ROOT for performance, volume size, or disaster recovery purposes it isn't going to help keep a server running in the event of a disk failure.
Given the complexity of doing this by hand right now and the limitation I'm not sure how worth it it is to do for ROOT. In many cases it will be much easier to place any data you need some durability for on a volume created separately from the coreos boot disk, that extra volume could be md, lvm, btrfs, etc.
I read that btrfs was not stable, and so coreos changed to ext4 with overlayfs.
Maybe its time to have a look at btrfs. The main guy behind btrfs - Mr merlin - is funded by google after all.