readfish icon indicating copy to clipboard operation
readfish copied to clipboard

Increasing slow batch count over time

Open hiruna72 opened this issue 5 months ago • 4 comments

Hello developers,

Thanks for developing this amazing tool!

I ran two readfish runs at the same time (run A and B) on a machine with two NVIDIA GeForce RTX 4070 SUPER 12GB GPUs, one for each run.

System RAM 64GB, 32 threads
ont-pybasecall-client-lib==7.8.3
minknow_api==6.4.9
dorado_basecall_server Version 7.8.3+f64462b6f, client-server API version 21.0.0
break_reads_after_seconds = 0.8 seconds (default)

TOML configs look like below

[mapper_settings.mappy_rs]
fn_idx_in = "reference.mmi"
n_threads = 8

[[regions]]
name = "run_A"
control = false
min_chunks = 0
max_chunks = 16
targets = "A.txt"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

I have observed the following two different slow batch count increments over time. Is that normal for run B?

Image Image

Thank you!

hiruna72 avatar Sep 29 '25 01:09 hiruna72

Thank you for your issue. Give us a little time to review it.

PS. You might want to check the FAQ if you haven't done so already.

This is an automated reply, generated by FAQtory

github-actions[bot] avatar Sep 29 '25 01:09 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Oct 29 '25 02:10 github-actions[bot]

My suspicion from looking at these plots and the log file reported in #406 is that some other process ran on the GPU responsible for run B starting at the time that the slow batches began to accumulate. It would appear from #406 that readfish on run B was unable to communicate with the basecall server again and this has caused the run failure. Usually we detect this, but for some reason here we do not. I note that we have observed similar issues on running adaptive sampling using the current version in MinKNOW.

Are your GPUs running anything else at the same time?

mattloose avatar Oct 29 '25 08:10 mattloose

Hi @mattloose,

Thanks for getting back to me. It’s quite unlikely that any other process used the GPU during the readfish run — the system is dedicated solely to readfishing.

That said, if a separate process did momentarily acquire GPU resources, could that cause a compounding effect where the number of slow batches continues to increase indefinitely (i.e., the system never recovers)?

As a temporary workaround, we’ve been restarting readfish every 8 hours, which seems to keep things stable for now.

As you mentioned, this might also be related to the MinKnow version. Interestingly, our second system running MinKnow 6.5.14 hasn’t shown any abnormal increases in slow batches so far.

hiruna72 avatar Nov 09 '25 07:11 hiruna72