labgrid File transfer is slow

When writing an 8MB image to a board I see this:

u-boot-rockchip.bin
      8,896,712 100%    1.03GB/s    0:00:00 (xfr#1, to-chk=0/1)

sent 346,038 bytes  received 35 bytes  692,146.00 bytes/sec
total size is 8,896,712  speedup is 25.71

8774144 bytes (8.8 MB, 8.4 MiB) copied, 10 s, 877 kB/s
17376+1 records in
17376+1 records out
8896712 bytes (8.9 MB, 8.5 MiB) copied, 10.1574 s, 876 kB/s

The last bit seems to be the 'dd'.

If I drop the 'oflag=direct' from USBStorageDriver.write_image I get:

4198912 bytes (4.2 MB, 4.0 MiB) copied, 1 s, 4.2 MB/s
17376+1 records in
17376+1 records out
8896712 bytes (8.9 MB, 8.5 MiB) copied, 2.82611 s, 3.1 MB/s

which is a bit better. Why is direct I/O needed?

commit 27087817f168fbfe5594dd4e2603e336abc05834
Author: Jan Luebbe <[email protected]>
Date:   Thu Jun 18 09:13:27 2020 +0200

    driver/usbstoragedriver: use dd with oflag=direct
    
    This avoid write-caching in the kernel's page cache, reducing disruption
    of concurrent processes and making the progress information more useful.
    
    Remove the leftover debug message of the current working directory.
    
    Signed-off-by: Jan Luebbe <[email protected]>

 labgrid/driver/usbstoragedriver.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

This doesn't make a lot of sense to me. Why not let the kernel handle the caching?

Apr 27 '24 18:04 sjg20

Because labgrid could at some point switch away the hardware from underneath the kernel, i.e. when using a USB SD mux or sd wire device. The write cache needs to be flushed either way, so my expectation would be that you'll have to wait for the cache to be flushed during a sync either way.

Apr 29 '24 07:04 Emantor

Thanks for the info

If I understand this correctly, fdatasync should be enough for syncing - directio is not needed and just slows things down. If the hardware disappears while writing or before syncing, then the dd will fail either way.

Apr 29 '24 16:04 sjg20

Perhaps the solution here is to use separate 'sync' after the dd?

May 19 '24 18:05 sjg20

@sjg20 I suspect that would work; is it actually any faster to do it that way?

May 20 '24 20:05 JPEWdev

Yes with rock2 u-boot-rockchip.bin (8898660 bytes)

With my change: Image written in 5.7s Without it: Image written in 10.6s

I am using quite slow media.

May 23 '24 18:05 sjg20

When writing large images (more than a few 100MiB, on hosts with just 1-2 GiB of RAM) without oflag=direct, useful data is discarded from the page cache, disrupting other workloads. Also, dd's progress data is mostly useless while writing to the cache.

What's the dd cmdline that's used in your case? Perhaps it's using 512 byte blocks and your min-io size is larger, triggering write-modify-write cycles (see lsblk -t).

Better approaches are definitely possible (perhaps MADV_PAGEOUT/MADV_DONTNEED/sync_file_range() with some sliding window, depending on what works with blockdevs), but that's a lot more complex than using dd which is available everywhere. bmaptool claims to be faster, so you might try that.

May 23 '24 19:05 jluebbe

That sounds like an edge case to me (low memory). I could make it use direct if the size is larger than 20MB, perhaps?

$ lsblk -t /dev/sdk NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME sdk 0 512 0 512 512 1 mq-deadline 2 128 0B

I do need skip and seek most of the time

the two versions are:

fast: dd if=/var/cache/labgrid/sglass/1a14c6e43b9fc3e3f68498f4bf72cec3e0de503cac147be7b27f2f5d7fbe682a/u-boot-rockchip.bin of=/dev/sdk bs=512 skip=0 seek=64

slow: dd if=/var/cache/labgrid/sglass/4da536865256736eaa1747b40bc1e90aeab44127b83a0e682046b766bc0b20ce/u-boot-rockchip.bin of=/dev/sdk oflag=direct bs=512 skip=0 seek=64 conv=fdatasync

May 23 '24 20:05 sjg20

I notice that the fdatasync doesn't slow things down on the one case I am testing here (so we can use that instead of a separate 'sync').

It is the direct I/O that is the problem

May 23 '24 20:05 sjg20

That sounds like an edge case to me (low memory). I could make it use direct if the size is larger than 20MB, perhaps?

The current implementation is driven by the use-cases we know about, like our lab with hundreds of places split into 16 per exporter (PCEngines APU with 4GB RAM). Many of them use USB-SD-Muxes, writing 500MiB-2GiB images. Keeping the influence on tests running in parallel is critical there and drove the change you cited. That's far from an edge case.

I'd be open to a driver-level attribute to disable oflag=direct, perhaps write-cache, defaulting to false?

Do you get sensible progress output from dd without oflag=direct? Previously it would report high speeds until write-back starts and then hang a long time at the end, which was confusing to users.

What sort of storage device are you using?

May 23 '24 20:05 jluebbe

Thanks for the background as to why this was done.

This is using uSD cards.

Re he driver-level attribute, would that need each board to put the attribute in its environment? Is there some overall setting that could be used? Using direct I/O seems to only be a win for large images on machines with not much memory.

May 24 '24 12:05 sjg20