agc icon indicating copy to clipboard operation
agc copied to clipboard

Not to buffer output sequences

Open lh3 opened this issue 9 months ago • 4 comments

When we specify multiple assemblies with getset, agc seems to buffer all the output in memory. If we pipe the agc output to another program (e.g. ropebwt3) that consumes the output slowly, agc will take hundreds of GB of memory for human pangenome. It will be great if agc uses a fixed-sized buffer such that it does not consume too much memory when the output is blocked.

Another related but more challenging use case is to replace each FASTA with a unix pipe. For example

ropebwt3 build -bo out.fmr <(agc getset -pt1 genomes.agc asm1) \
  <(agc getset -pt1 genomes.agc asm2) \
  <(agc getset -pt1 genomes.agc asm3)

In this case, each agc instance may need to load the index into memory (is that right?). Is it possible to retrieve sequences without loading the entire index?

Using agc APIs wouldn't have these problems but for tools not using the APIs, it would be good to have a workaround.

lh3 avatar Sep 27 '24 17:09 lh3