funannotate icon indicating copy to clipboard operation
funannotate copied to clipboard

Issues when pipelining with snakemake

Open BenjaminSchwessinger opened this issue 4 years ago • 9 comments

Hi,

I am having an issue when pipelining multiple samples with snakemake. It appears that funannotate (e.g. clean) uses the current working directory to store temporary files. These get overwritten at times when they have the same name. Would it be possible to have an outdir or tmpdir variable for all the funannotate commands so I could specify where these files with similar names get saved instead of the working directory? Any pointers in the code of where to fix it and I can have a go as well. Thanks!

BenjaminSchwessinger avatar May 07 '21 06:05 BenjaminSchwessinger

Okay I fixed it now as follows

e.g. in the clean.py script I added the following around 160: if not os.path.dirname(args.out) == "": os.chdir(os.path.dirname(args.out))

[needed to add an 'import os' to clean.py as well.]

similar for annotate.py, sort.py, predict.py with just after args = parser.parse_args(args):

if not os.path.dirname(args.input) == "": os.chdir(os.path.dirname(args.input))

This avoids the issue that all the different funannotate clean (etc.) runs are initiated in the same folder by snakemake.

BenjaminSchwessinger avatar May 07 '21 10:05 BenjaminSchwessinger

Thanks @BenjaminSchwessinger -- I see that perhaps sort/clean should have some tmpdir which I can add with a unique tag. But I don't know about predict? Can you give me a little more insight as to how these jobs are being launched? Typically funannotate runs from whatever folder you launched it from, I think most file paths should get corrected (I hope).

nextgenusfs avatar May 07 '21 17:05 nextgenusfs

Thanks Jon.

So for clean a temp folder would be good as snakemake executes all the runs in the same folder so the "minimap_tmp*" "query_" and "reference_" of different runs overwrite each other. Also having a tmp folder would be great for using a fast i/o of a cluster.

For the predict step "p2g_*" folder is a real bottleneck as it cannot be retricted to a fast i/o or specific location. This is an issue as it has SOOOOO many temporary files that it hits file number limits really easily. So being able to redirect some of these to other locations would be great.

It all works well on individual samples just scaling it is a real issue.

Thanks for the help.

BenjaminSchwessinger avatar May 10 '21 10:05 BenjaminSchwessinger

Okay that makes sense and also a good suggestion. So in other words ensure temp folders also have a command line option so you can redirect. In the past I've tried to keep those "files" in memory for exonerate instead of writing to file but could never get it to work properly. As with most things, I'm sure it could be improved.

nextgenusfs avatar May 10 '21 13:05 nextgenusfs

Yes pretty much. I found a work around with tweaking your code a tine bit specific for our cluster that works really well. Makes a huge difference being able to use the fast i/o as this is the actual bottleneck on that system and not the compute. Thanks Jon.

BenjaminSchwessinger avatar May 10 '21 20:05 BenjaminSchwessinger

Ben if you have a good example of the snakemake wrappers as they work would be great - i've resisted getting into that level of pipelining it for our projects as it adds a learning curve for the folks in my group trying to use the system, so we have simpler shell scripting focused ways of executing, but it would be helpful to know how things have panned out in your hands with the recipes.

hyphaltip avatar May 12 '21 05:05 hyphaltip

Sure thing. Jason. Let me clean it up and I put it on github next week with a bit of instructions. I also adapted the code from funannotate minimally to allow processing on the cluster. will post here once it is up on github with a small README.

BenjaminSchwessinger avatar May 12 '21 07:05 BenjaminSchwessinger

Hi everyone,

very interesting discussion going on here. I have also been using funannotate in a snakemake pipeline and had problems with temp directories at some point. For me this was however dependent on the cluster setup. In one case I remember that the compute nodes of the cluster did not have access to the temp directory. I never quite figured out if this was a problem of funannotate or snakemake. I should say that I use funannotate in a singularity container which has additional restrictions on which places are writeable. One solution for me was to change the $TMPDIR variable inside individual rules (eg. when using funannotate predict) and point it to a different location which is writeable.

If you are interested you can check it out here: https://github.com/reslp/smsi-funannotate. My pipeline uses a singularity containers of funannotate, eggnog-mapper and interproscan and a wrapper bash script to submit many jobs to SLURM and SGE clusters. Currrently it is set to work with the two clusters I have access to, but with a few changes it should be possible to adapt it to other clusters as well. It is not super polished, but it has worked quite will for me and several of my students and colleagues. I would be happy to hear your thoughts on it.

best, Philipp

reslp avatar May 17 '21 10:05 reslp

Hi there. It took me a while to clean stuff up and our HPC was down as well.

So here is a very basic snakemake pipeline for funannotate in w/o RNASeq data and such https://github.com/BenjaminSchwessinger/snakemake_recipes/tree/main/recepies/genomeAnnotation.

It also has the cluster.yaml and config.yaml in the same folder. There are a couple specific comments for your HPC in there as well.

It isn't perfect as it doesn't make use of all the thrills snakemake comes with eg. threat handling but maybe a start. There is also a 'submit_onnode.sh' script that shows how the main script submits the snakemake jobs from the node. This is based on Kevin Murray work from Borevitz (now Weigel lab). Happy to answer any questions you might help. Hope it helps.

BenjaminSchwessinger avatar Jun 04 '21 05:06 BenjaminSchwessinger