Not working for UCSC fasta and annotation
Hi team,
thanks for developing this wonderful tool. I just want to confirm if it is designed not to support UCSC genome and annotation files?
I found if I use hg38.fa and TxDb.Hsapiens.UCSC.hg38.knownGene, the bambu function gave me an error.
But if I supplied it with Homo_sapiens.GRCh38.dna.primary_assembly.fa and Homo_sapiens.GRCh38.109.gtf, it run without error.
In both case, the BAM files were generated using the corresponding fasta with the same parameter of minimap2, eg
minimap2 -ax splice $fasta SGNex_MCF7_cDNA_replicate1_run3/SGNex_MCF7_cDNA_replicate1_run3.fastq.gz -t 24 -I 1000G | samtools view -@ 24 -bSh - -o mcf7_rep1_run3_align.bam
samtools sort -@ 24 mcf7_rep1_run3_align.bam -o mcf7_rep1_run3_sorted.bam
samtools index -@ 24 mcf7_rep1_run3_sorted.bam
So if I run with hg38
test.bam <- 'mcf7_rep1_run3_sorted.bam'
fa.file <- '/home/users/allstaff/yan.a/yan.a/ref/hg38/hg38.fa'
system.time(bambuAnnotations <- keepStandardChromosomes(TxDb.Hsapiens.UCSC.hg38.knownGene) %>% prepareAnnotations)
system.time(se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file,
ncore = 1, quant = F, verbose = T, NDR=1,
rcOutDir = 'rcout'))
I got the error
Error: BiocParallel errors
1 remote errors, element index: 1
0 unevaluated and other errors
first remote error:
Error in relist(unlistData, partitioning): shape of 'skeleton' is not compatible with 'NROW(flesh)'
But if I run with ENSEMBL fasta
test.bam <- 'mcf7_rep1_run3_align_GRCh38.bam'
fa.file <- '/home/users/allstaff/yan.a/yan.a/ref/GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa'
gtf.file <- '/home/users/allstaff/yan.a/yan.a/ref/GRCh38/Homo_sapiens.GRCh38.109.gtf'
bambuAnnotations <- prepareAnnotations(gtf.file)
system.time(se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file,
ncore = 1, quant = F, verbose = T, NDR=1,
rcOutDir = 'rcout'))
I finish with only some warning. Could you double check it? I hope it would support UCSC because then we can seamlessly integrate with the genomedb and txdb in bioconductor.
Thanks a lot. Cheers, Alex
Hi Alex,
Thanks for using Bambu. I did some digging and I realized that after using prepareAnnotations on TxDb.Hsapiens.UCSC.hg38.knownGene, 16% of the transcripts that do not have a gene id assigned. Bambu needs all transcripts to have a gene id to be able to function. I do not yet know if this is because of the annotations themselves or due to the bambu prepareAnnotations function and I will investigate this further.
However as a temporary measure so that you can use UCSC you can remove these annotations
bambuAnnotations = bambuAnnotations[which(!is.na(mcols(bambuAnnotations)$GENEID))]
Alternatively you could assign your own arbitrary gene ids to these to keep them in the annotations
mcols(bambuAnnotations[which(is.na(mcols(bambuAnnotations)$GENEID))])$GENEID = seq_len(sum(is.na(mcols(bambuAnnotations)$GENEID)))
Hope this helps for now,
Kind Regards, Andre Sim