Not a valid path value when provide genome by fasta and gtf in compressed gz format, and not providing genome index or gene bed
If provide genome by fasta and gtf, workflow will try to generate genome index and gene bed if they are not provided. It works fine when fasta and gtf are uncompressed. However, if any one of fasta and gtf is compressed (in gz format in my case), and the corresponding downstream file (i.e. genome index or gene bed) is not provided, I will get the following error:
Not a valid path value: '../genome/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz'
Not a valid path value: '../genome/Mus_musculus.GRCm38.102.gtf.gz'
My params.yaml is
input: './samplesheet.csv'
outdir: './results/'
fasta: '../genome/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz'
gtf: '../genome/Mus_musculus.GRCm38.102.gtf.gz'
narrow_peak: true
aligner: 'bowtie2'
read_length: 150
max_memory: '50.GB'
max_cpus: 24
save_reference: true
My command is
HTTPS_PROXY=http://localhost:1081 nextflow run $PWD/../pipeline/nf-core-chipseq/2_1_0/main.nf -profile singularity -resume -params-file params.yaml
Both should be right because it works fine if I uncompress 'Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz' and 'Mus_musculus.GRCm38.102.gtf'.
I also find some related issues in other nf-core workflows:
https://github.com/nf-core/rnaseq/issues/1311
https://github.com/nf-core/atacseq/issues/277
https://github.com/nf-core/cutandrun/issues/187
Thank you.
same for the --gff option
@ljw20180420
Cause
I've managed to fix this bug, basically in v2.1.0 they have upgraded their gffread process directive to match the latest nf-core/rnaseq directive for gffread, but they did not correspondingly update the prepare_genome.nf channel handling to match the expected inputs of the upgraded gffread process directive:
v2.1.0: https://github.com/nf-core/chipseq/blob/master/modules/nf-core/gffread/main.nf
In v2.0.0 this gffread process expects just the path of the gff file:
input:
path gff
, but in v2.1.0 the process now expects more inputs:
input:
tuple val(meta), path(gff)
path fasta
Solution
The solution was to update the channel handling code in prepare_genome.nf to match these new input requirements of gffread: https://github.com/nf-core/chipseq/blob/master/subworkflows/local/prepare_genome.nf
//
// Uncompress GTF annotation file or create from GFF3 if required
//
if (gtf) {
if (gtf.endsWith('.gz')) {
ch_gtf = GUNZIP_GTF ( [ [:], gtf ] ).gunzip.map { it[1] }
ch_versions = ch_versions.mix(GUNZIP_GTF.out.versions)
} else {
ch_gtf = Channel.value(file(gtf))
}
} else if (gff) {
if (gff.endsWith('.gz')) {
ch_gff = GUNZIP_GFF ( [ [:], gff ] ).gunzip.map { it[1] }
ch_versions = ch_versions.mix(GUNZIP_GFF.out.versions)
} else {
ch_gff = Channel.value(file(gff))
}
// Create a channel emitting tuples of (meta, gff)
ch_gff = ch_gff.map { gff_file ->
def meta = [ id: gff_file.baseName.tokenize('.')[0] ] // Remove extensions
tuple(meta, gff_file)
}
// Pass the input channel to GFFREAD along with ch_fasta
ch_gtf = GFFREAD(ch_gff, ch_fasta).gtf
ch_versions = ch_versions.mix(GFFREAD.out.versions)
}
Just replace your prepare_genome.nf with the following code (NOTE: I also fix the channel handling for gtf2bed in here):
//
// Uncompress and prepare reference genome files
//
include {
GUNZIP as GUNZIP_FASTA
GUNZIP as GUNZIP_GTF
GUNZIP as GUNZIP_GFF
GUNZIP as GUNZIP_GENE_BED
GUNZIP as GUNZIP_BLACKLIST
} from '../../modules/nf-core/gunzip/main'
include {
UNTAR as UNTAR_BWA_INDEX
UNTAR as UNTAR_BOWTIE2_INDEX
UNTAR as UNTAR_STAR_INDEX
} from '../../modules/nf-core/untar/main'
include { UNTARFILES } from '../../modules/nf-core/untarfiles/main'
include { GFFREAD } from '../../modules/nf-core/gffread/main'
include { CUSTOM_GETCHROMSIZES } from '../../modules/nf-core/custom/getchromsizes/main'
include { BWA_INDEX } from '../../modules/nf-core/bwa/index/main'
include { BOWTIE2_BUILD } from '../../modules/nf-core/bowtie2/build/main'
include { CHROMAP_INDEX } from '../../modules/nf-core/chromap/index/main'
include { GTF2BED } from '../../modules/local/gtf2bed'
include { GENOME_BLACKLIST_REGIONS } from '../../modules/local/genome_blacklist_regions'
include { STAR_GENOMEGENERATE } from '../../modules/local/star_genomegenerate'
workflow PREPARE_GENOME {
take:
genome // string: genome name
genomes // map: genome attributes
prepare_tool_index // string : tool to prepare index for
fasta // path: path to genome fasta file
gtf // file: /path/to/genome.gtf
gff // file: /path/to/genome.gff
blacklist // file: /path/to/blacklist.bed
gene_bed // file: /path/to/gene.bed
bwa_index // file: /path/to/bwa/index/
bowtie2_index // file: /path/to/bowtie2/index/
chromap_index // file: /path/to/chromap/index/
star_index // file: /path/to/star/index/
main:
ch_versions = Channel.empty()
//
// Uncompress genome fasta file if required
//
if (fasta.endsWith('.gz')) {
ch_fasta = GUNZIP_FASTA ( [ [:], fasta ] ).gunzip.map { it[1] }
ch_versions = ch_versions.mix(GUNZIP_FASTA.out.versions)
} else {
ch_fasta = Channel.value(file(fasta))
}
//
// Uncompress GTF annotation file or create from GFF3 if required
//
if (gtf) {
if (gtf.endsWith('.gz')) {
ch_gtf = GUNZIP_GTF ( [ [:], gtf ] ).gunzip.map { it[1] }
ch_versions = ch_versions.mix(GUNZIP_GTF.out.versions)
} else {
ch_gtf = Channel.value(file(gtf))
}
} else if (gff) {
if (gff.endsWith('.gz')) {
ch_gff = GUNZIP_GFF ( [ [:], gff ] ).gunzip.map { it[1] }
ch_versions = ch_versions.mix(GUNZIP_GFF.out.versions)
} else {
ch_gff = Channel.value(file(gff))
}
// Create a channel emitting tuples of (meta, gff)
ch_gff = ch_gff.map { gff_file ->
def meta = [ id: gff_file.baseName.tokenize('.')[0] ] // Remove extensions
tuple(meta, gff_file)
}
// Pass the input channel to GFFREAD along with ch_fasta
ch_gtf = GFFREAD(ch_gff, ch_fasta).gtf
ch_versions = ch_versions.mix(GFFREAD.out.versions)
}
//
// Uncompress blacklist file if required
//
ch_blacklist = Channel.empty()
if (blacklist) {
if (blacklist.endsWith('.gz')) {
ch_blacklist = GUNZIP_BLACKLIST ( [ [:], blacklist ] ).gunzip.map { it[1] }
ch_versions = ch_versions.mix(GUNZIP_BLACKLIST.out.versions)
} else {
ch_blacklist = Channel.value(file(blacklist))
}
}
//
// Uncompress gene BED annotation file or create from GTF if required
//
// Uncompress gene BED annotation file or create from GTF if required
def make_bed = false
if (!gene_bed) {
make_bed = true
} else if (genome && gtf) {
if (genomes[genome].gtf != gtf) {
make_bed = true
}
}
if (make_bed) {
// ch_gtf emits tuples of (meta, gtf)
// Extract the GTF file path from the tuple
ch_gtf_path = ch_gtf.map { tuple -> tuple[1] }
// Pass the GTF file path to GTF2BED
ch_gene_bed = GTF2BED(ch_gtf_path).bed
ch_versions = ch_versions.mix(GTF2BED.out.versions)
} else {
if (gene_bed.endsWith('.gz')) {
ch_gene_bed = GUNZIP_GENE_BED ( [ [:], gene_bed ] ).gunzip.map { it[1] }
ch_versions = ch_versions.mix(GUNZIP_GENE_BED.out.versions)
} else {
ch_gene_bed = Channel.value(file(gene_bed))
}
}
//
// Create chromosome sizes file
//
CUSTOM_GETCHROMSIZES(ch_fasta.map { [[:], it] })
ch_chrom_sizes = CUSTOM_GETCHROMSIZES.out.sizes.map { it[1] }
ch_fai = CUSTOM_GETCHROMSIZES.out.fai.map { it[1] }
ch_versions = ch_versions.mix(CUSTOM_GETCHROMSIZES.out.versions)
//
// Prepare genome intervals for filtering by removing regions in blacklist file
//
ch_genome_filtered_bed = Channel.empty()
GENOME_BLACKLIST_REGIONS(
ch_chrom_sizes,
ch_blacklist.ifEmpty([])
)
ch_genome_filtered_bed = GENOME_BLACKLIST_REGIONS.out.bed
ch_versions = ch_versions.mix(GENOME_BLACKLIST_REGIONS.out.versions)
//
// Uncompress BWA index or generate from scratch if required
//
ch_bwa_index = Channel.empty()
if (prepare_tool_index == 'bwa') {
if (bwa_index) {
if (bwa_index.endsWith('.tar.gz')) {
ch_bwa_index = UNTAR_BWA_INDEX([[:], bwa_index]).untar
ch_versions = ch_versions.mix(UNTAR_BWA_INDEX.out.versions)
} else {
ch_bwa_index = [[:], file(bwa_index)]
}
} else {
ch_bwa_index = BWA_INDEX(ch_fasta.map { [[:], it] }).index
ch_versions = ch_versions.mix(BWA_INDEX.out.versions)
}
}
//
// Uncompress Bowtie2 index or generate from scratch if required
//
ch_bowtie2_index = Channel.empty()
if (prepare_tool_index == 'bowtie2') {
if (bowtie2_index) {
if (bowtie2_index.endsWith('.tar.gz')) {
ch_bowtie2_index = UNTAR_BOWTIE2_INDEX([[:], bowtie2_index]).untar
ch_versions = ch_versions.mix(UNTAR_BOWTIE2_INDEX.out.versions)
} else {
ch_bowtie2_index = [[:], file(bowtie2_index)]
}
} else {
ch_bowtie2_index = BOWTIE2_BUILD(ch_fasta.map { [[:], it] }).index
ch_versions = ch_versions.mix(BOWTIE2_BUILD.out.versions)
}
}
//
// Uncompress CHROMAP index or generate from scratch if required
//
ch_chromap_index = Channel.empty()
if (prepare_tool_index == 'chromap') {
if (chromap_index) {
if (chromap_index.endsWith('.tar.gz')) {
ch_chromap_index = UNTARFILES([[:], chromap_index]).files
ch_versions = ch_versions.mix(UNTARFILES.out.versions)
} else {
ch_chromap_index = [[:], file(chromap_index)]
}
} else {
ch_chromap_index = CHROMAP_INDEX(ch_fasta.map { [[:], it] }).index
ch_versions = ch_versions.mix(CHROMAP_INDEX.out.versions)
}
}
//
// Uncompress STAR index or generate from scratch if required
//
ch_star_index = Channel.empty()
if (prepare_tool_index == 'star') {
if (star_index) {
if (star_index.endsWith('.tar.gz')) {
ch_star_index = UNTAR_STAR_INDEX([[:], star_index]).untar.map { it[1] }
ch_versions = ch_versions.mix(UNTAR_STAR_INDEX.out.versions)
} else {
ch_star_index = Channel.value(file(star_index))
}
} else {
ch_star_index = STAR_GENOMEGENERATE(ch_fasta, ch_gtf).index
ch_versions = ch_versions.mix(STAR_GENOMEGENERATE.out.versions)
}
}
emit:
fasta = ch_fasta // path: genome.fasta
fai = ch_fai // path: genome.fai
gtf = ch_gtf // path: genome.gtf
gene_bed = ch_gene_bed // path: gene.bed
chrom_sizes = ch_chrom_sizes // path: genome.sizes
filtered_bed = ch_genome_filtered_bed // path: *.include_regions.bed
bwa_index = ch_bwa_index // path: bwa/index/
bowtie2_index = ch_bowtie2_index // path: bowtie2/index/
chromap_index = ch_chromap_index // path: genome.index
star_index = ch_star_index // path: star/index/
versions = ch_versions.ifEmpty(null) // channel: [ versions.yml ]
}
Thank you very mush. It seems a little complex for me. Do you have a plan to fix this directly in the repository https://github.com/nf-core/chipseq/blob/master/subworkflows/local/prepare_genome.nf so that others can get rid of this bug forever?
@ljw20180420 I am not part of their dev team, but there are multiple other errors I am debugging. Currently there have been 5-6 bugs I ran into, and I plan to make a PR request for each fix for them to incorporate