I was annotating a mammalian genome and the program crashed. There does not appear to be intermediate files (beyond logs). Before I profile is it likely I just went above 500Gb of RAM and should be running this on a high memory node?
I am getting lots and lots of warnings about reads having multiple hits and/or mapping to multiple chromosomes. This does not surprise me as I am giving the program the bam file from HISAT2 (and mammalian genomes have lots of pseudogenes, etc). Should that bam file be pre-processed to only consider "-q 30" read pairs? What is "best practice".

Jun 28 '18 18:06 tdlong

@tdlong Hi I would love to create a user group. Thank you for the suggestion!

How large is your bam file? If it went beyond RAM capacity, it can fail. But it might be bad memory allocations. It will be great if you can provide me such a BAM file so I can do something about it.
Yep. "-q 10" might be enough. I would recommend it. I am currently working on a User guide. But Haven't quite finished. I will include the best practice in the user guide.

Jun 29 '18 03:06 ruolin

My bam file is 123Gb! I have a few flow cells of RNA seq data from several different tissue I wish to use to annotate a de novo genome assembly. I am now filtering the bam file to only include HQ mapped pairs. This should reduce the size (I will keep track of this!).

It sounds like the best thing for me to do is to re-run the filtered bam, and memory profile while it is running. I will report back with the profile w.r.t. time.

Thanks for writing the software! I think it an important contribution as we see more and more de novo assemblies, and it is fairly cost effective to get pretty good RNAseq from several dozen tissues.

I guess one strategy would be to run the program on a small subset of the data and somehow subtract {housekeeping} genes from the bam file that are already annotated. With these RNAseq dataset the top 100 genes can suck up >25% of the reads.

T.

On Jun 28, 2018, at 8:29 PM, Ruolin Liu [email protected] wrote:

@tdlong https://github.com/tdlong Hi I would love to create a user group. Thank you for the suggestion!

How large is your bam file? If it went beyond RAM capacity, it can fail. But it might be bad memory allocations. It will be great if you can provide me such a BAM file so I can do something about it.

Yep. "-q 10" might be enough. I would recommend it. I am currently working on a User guide. But Haven't quite finished. I will include the best practice in the user guide.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ruolin/strawberry/issues/19#issuecomment-401236840, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNNxukUfzzT4QGOGCYOdJYT1nXeG1Qks5uBZ8QgaJpZM4U7339.

Jun 29 '18 04:06 tdlong

@tdlong First of all, thanks for using Strawberry and giving me some feedback. I really appreciate it. And please let me know after you profile it. I am willing to work with you to fix any issues/bugs if you found.

For your purpose, I agree if you have a very large bam. You can select a subset by using a known set of loci. But those highly expressed genes might be somehow in your interest.

I am considering a feature to process BAM file on the fly to avoid such mem problems. I am also interested in know if you have problems running other de novo assemblers, like Cufflinks or StringTie?

Jul 02 '18 04:07 ruolin

It is on the big memory node now.

Here is the script I submitted to my SGE queuing software.

#$ -N strawberry #$ -q bigmemory #$ -pe openmp 80 #$ -R y

module purge module load samtools/1.8-11 module load perl/5.16.2 module load java/1.8.0.51

#/usr/bin/time -v samtools index hisat2_out/mouse.RNAseq.filter.sort.bam #echo "finished samtools run at $(date)" /usr/bin/time -v ./Strawberry/bin/strawberry hisat2_out/mouse.RNAseq.filter.sort.bam -o strawberry_June26 -p 80

Oddly, profiling the job it only appears to be using 6 cores, not the 80 I passed the program. It is also using those cores rather oddly.

Funny - depending on what measure you use, it's either using 605% of a core (~6 cores - via ps):

or 1 core (via top)

But note the time consumed in 'top'. 1955min or about 33 hours. But divided by 6, it's 5.4hrs, approximately the time it's been running, mod the period when it wasn't running in parallel. So that's odd. top is definitely underestimating the load.

On Jul 1, 2018, at 9:59 PM, Ruolin Liu [email protected] wrote:

@tdlong https://github.com/tdlong First of all, thanks for using Strawberry and giving me some feedback. I really appreciate it. And please let me know after you profile it. I am willing to work with you to fix any issues/bugs if you found.

For your purpose, I agree if you have a very large bam. You can select a subset by using a known set of loci. But those highly expressed genes might be somehow in your interest.

I am considering a feature to process BAM file on the fly to avoid such mem problems. I am also interested in know if you have problems running other de novo assemblers, like Cufflinks or StringTie?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ruolin/strawberry/issues/19#issuecomment-401672854, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNN-n1oVELZuWp4wZ6iUn8juHuhJ0Uks5uCai-gaJpZM4U7339.

Jul 02 '18 22:07 tdlong

@tdlong Currently the parallelization can be improved a lot. The current multithreading has a huge overhead of dispatching. So I am not very surprised to see the low cpu load. Actually, I am recommending using 10 cores as the maximum for now so you won't waste the resources. Better multithreading is a feature I am working on right now. Right now using -p 10 is enough.

Jul 03 '18 05:07 ruolin

It seems to do a pretty good job. In some case my old Trinity -> Augustus pipeline seems a little closer to exonerate predictions, in other cases Hisat2 -> strawberry.

When I filtered the input bam file the program did not crash...here is my "pipeline" for your reference.

foreach RNAseq_experiment: hisat2 -p 8 -x $TREF -1 $R1 -2 $R2 | samtools view -Sbo hisat2_out/$samplename.bam - samtools sort -o hisat2_out/$samplename.sort.bam hisat2_out/$samplename.bam

merge into one big file

ls hisat2_out/*.sort.bam >bamfiles.tomerge.txt bamtools merge -list bamfiles.tomerge.txt -out hisat2_out/RNAseq.bam samtools sort -o hisat2_out/RNAseq.sort.bam hisat2_out/RNAseq.bam

filter out poorly mapped reads

samtools view -b -f 0x2 -q 30 hisat2_out/RNAseq.sort.bam > hisat2_out/RNAseq.filter.sort.bam samtools index hisat2_out/mRNAseq.filter.sort.bam

filtered bam file is 75Gb, roughly half the size of the unfiltered

./Strawberry/bin/strawberry hisat2_out/RNAseq.filter.sort.bam -o strawberry_June26 -p 8 cd strawberry_June26/

I want to visualize in SCGB

module load ucsc-tools/jan-19-2016 gtfToGenePred assembled_transcripts.gtf strawberry.Gp genePredToBed strawberry.Gp strawberry.BED12

bed formatting

sort -k1,1 -k2,2n strawberry.BED12 >temp.temp sizes=".../PP.chrom.sizes" bedToBigBed -type=bed12 temp.temp $sizes strawberry.bigBed -tab

On Jul 2, 2018, at 10:17 PM, Ruolin Liu [email protected] wrote:

@tdlong https://github.com/tdlong Currently the parallelization can be improved a lot. The current multithreading has a huge overhead of dispatching. So I am not very surprised to see the low cpu load. Actually, I am recommending using 10 cores as the maximum for now so you won't waste the resources. Better multithreading is a feature I am working on right now. Right now using -p 10 is enough.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ruolin/strawberry/issues/19#issuecomment-402015864, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNN7FUdRStdq2F55pHEWknAAw2l5zGks5uCv6FgaJpZM4U7339.

Jul 05 '18 15:07 tdlong

Is there a usegroup?

merge into one big file

filter out poorly mapped reads

filtered bam file is 75Gb, roughly half the size of the unfiltered

I want to visualize in SCGB

bed formatting