agc
agc copied to clipboard
Not to buffer output sequences
When we specify multiple assemblies with getset
, agc seems to buffer all the output in memory. If we pipe the agc output to another program (e.g. ropebwt3) that consumes the output slowly, agc will take hundreds of GB of memory for human pangenome. It will be great if agc uses a fixed-sized buffer such that it does not consume too much memory when the output is blocked.
Another related but more challenging use case is to replace each FASTA with a unix pipe. For example
ropebwt3 build -bo out.fmr <(agc getset -pt1 genomes.agc asm1) \
<(agc getset -pt1 genomes.agc asm2) \
<(agc getset -pt1 genomes.agc asm3)
In this case, each agc instance may need to load the index into memory (is that right?). Is it possible to retrieve sequences without loading the entire index?
Using agc APIs wouldn't have these problems but for tools not using the APIs, it would be good to have a workaround.