chipseq Not a valid path value when provide genome by fasta and gtf in compressed gz format, and not providing genome index or gene bed

If provide genome by fasta and gtf, workflow will try to generate genome index and gene bed if they are not provided. It works fine when fasta and gtf are uncompressed. However, if any one of fasta and gtf is compressed (in gz format in my case), and the corresponding downstream file (i.e. genome index or gene bed) is not provided, I will get the following error:

Not a valid path value: '../genome/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz'
Not a valid path value: '../genome/Mus_musculus.GRCm38.102.gtf.gz'

My params.yaml is

input: './samplesheet.csv'
outdir: './results/'
fasta: '../genome/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz'
gtf: '../genome/Mus_musculus.GRCm38.102.gtf.gz'
narrow_peak: true
aligner: 'bowtie2'
read_length: 150
max_memory: '50.GB'
max_cpus: 24
save_reference: true

My command is

HTTPS_PROXY=http://localhost:1081 nextflow run $PWD/../pipeline/nf-core-chipseq/2_1_0/main.nf -profile singularity -resume -params-file params.yaml

Both should be right because it works fine if I uncompress 'Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz' and 'Mus_musculus.GRCm38.102.gtf'.

I also find some related issues in other nf-core workflows:

https://github.com/nf-core/rnaseq/issues/1311
https://github.com/nf-core/atacseq/issues/277
https://github.com/nf-core/cutandrun/issues/187

Thank you.

Oct 16 '24 12:10 ljw20180420

same for the --gff option

Oct 22 '24 06:10 Nicobouch

@ljw20180420

Cause

I've managed to fix this bug, basically in v2.1.0 they have upgraded their gffread process directive to match the latest nf-core/rnaseq directive for gffread, but they did not correspondingly update the prepare_genome.nf channel handling to match the expected inputs of the upgraded gffread process directive:

v2.1.0: https://github.com/nf-core/chipseq/blob/master/modules/nf-core/gffread/main.nf

In v2.0.0 this gffread process expects just the path of the gff file:

    input:
    path gff

, but in v2.1.0 the process now expects more inputs:

    input:
    tuple val(meta), path(gff)
    path fasta

Solution

The solution was to update the channel handling code in prepare_genome.nf to match these new input requirements of gffread: https://github.com/nf-core/chipseq/blob/master/subworkflows/local/prepare_genome.nf

    //
    // Uncompress GTF annotation file or create from GFF3 if required
    //
    if (gtf) {
        if (gtf.endsWith('.gz')) {
            ch_gtf      = GUNZIP_GTF ( [ [:], gtf ] ).gunzip.map { it[1] }
            ch_versions = ch_versions.mix(GUNZIP_GTF.out.versions)
        } else {
            ch_gtf = Channel.value(file(gtf))
        }
    } else if (gff) {
        if (gff.endsWith('.gz')) {
            ch_gff      = GUNZIP_GFF ( [ [:], gff ] ).gunzip.map { it[1] }
            ch_versions = ch_versions.mix(GUNZIP_GFF.out.versions)
        } else {
            ch_gff = Channel.value(file(gff))
        }

        // Create a channel emitting tuples of (meta, gff)  
        ch_gff = ch_gff.map { gff_file ->
            def meta = [ id: gff_file.baseName.tokenize('.')[0] ]  // Remove extensions
            tuple(meta, gff_file)
        }

        // Pass the input channel to GFFREAD along with ch_fasta
        ch_gtf = GFFREAD(ch_gff, ch_fasta).gtf
        ch_versions = ch_versions.mix(GFFREAD.out.versions)
    }

Just replace your prepare_genome.nf with the following code (NOTE: I also fix the channel handling for gtf2bed in here):

//
// Uncompress and prepare reference genome files
//

include {
    GUNZIP as GUNZIP_FASTA
    GUNZIP as GUNZIP_GTF
    GUNZIP as GUNZIP_GFF
    GUNZIP as GUNZIP_GENE_BED
    GUNZIP as GUNZIP_BLACKLIST
} from '../../modules/nf-core/gunzip/main'

include {
    UNTAR as UNTAR_BWA_INDEX
    UNTAR as UNTAR_BOWTIE2_INDEX
    UNTAR as UNTAR_STAR_INDEX
} from '../../modules/nf-core/untar/main'

include { UNTARFILES } from '../../modules/nf-core/untarfiles/main'
include { GFFREAD } from '../../modules/nf-core/gffread/main'
include { CUSTOM_GETCHROMSIZES } from '../../modules/nf-core/custom/getchromsizes/main'
include { BWA_INDEX } from '../../modules/nf-core/bwa/index/main'
include { BOWTIE2_BUILD } from '../../modules/nf-core/bowtie2/build/main'
include { CHROMAP_INDEX } from '../../modules/nf-core/chromap/index/main'

include { GTF2BED } from '../../modules/local/gtf2bed'
include { GENOME_BLACKLIST_REGIONS } from '../../modules/local/genome_blacklist_regions'
include { STAR_GENOMEGENERATE } from '../../modules/local/star_genomegenerate'

workflow PREPARE_GENOME {
    take:
    genome             //  string: genome name
    genomes            //     map: genome attributes
    prepare_tool_index // string  : tool to prepare index for
    fasta              //    path: path to genome fasta file
    gtf                //    file: /path/to/genome.gtf
    gff                //    file: /path/to/genome.gff
    blacklist          //    file: /path/to/blacklist.bed
    gene_bed           //    file: /path/to/gene.bed
    bwa_index          //    file: /path/to/bwa/index/
    bowtie2_index      //    file: /path/to/bowtie2/index/
    chromap_index      //    file: /path/to/chromap/index/
    star_index         //    file: /path/to/star/index/

    main:

    ch_versions = Channel.empty()

    //
    // Uncompress genome fasta file if required
    //
    if (fasta.endsWith('.gz')) {
        ch_fasta    = GUNZIP_FASTA ( [ [:], fasta ] ).gunzip.map { it[1] }
        ch_versions = ch_versions.mix(GUNZIP_FASTA.out.versions)
    } else {
        ch_fasta = Channel.value(file(fasta))
    }

    //
    // Uncompress GTF annotation file or create from GFF3 if required
    //
    if (gtf) {
        if (gtf.endsWith('.gz')) {
            ch_gtf      = GUNZIP_GTF ( [ [:], gtf ] ).gunzip.map { it[1] }
            ch_versions = ch_versions.mix(GUNZIP_GTF.out.versions)
        } else {
            ch_gtf = Channel.value(file(gtf))
        }
    } else if (gff) {
        if (gff.endsWith('.gz')) {
            ch_gff      = GUNZIP_GFF ( [ [:], gff ] ).gunzip.map { it[1] }
            ch_versions = ch_versions.mix(GUNZIP_GFF.out.versions)
        } else {
            ch_gff = Channel.value(file(gff))
        }

        // Create a channel emitting tuples of (meta, gff)
        ch_gff = ch_gff.map { gff_file ->
            def meta = [ id: gff_file.baseName.tokenize('.')[0] ]  // Remove extensions
            tuple(meta, gff_file)
        }

        // Pass the input channel to GFFREAD along with ch_fasta
        ch_gtf = GFFREAD(ch_gff, ch_fasta).gtf
        ch_versions = ch_versions.mix(GFFREAD.out.versions)
    }

    //
    // Uncompress blacklist file if required
    //
    ch_blacklist = Channel.empty()
    if (blacklist) {
        if (blacklist.endsWith('.gz')) {
            ch_blacklist = GUNZIP_BLACKLIST ( [ [:], blacklist ] ).gunzip.map { it[1] }
            ch_versions  = ch_versions.mix(GUNZIP_BLACKLIST.out.versions)
        } else {
            ch_blacklist = Channel.value(file(blacklist))
        }
    }

    //
    // Uncompress gene BED annotation file or create from GTF if required
    //

    // Uncompress gene BED annotation file or create from GTF if required
    def make_bed = false
    if (!gene_bed) {
        make_bed = true
    } else if (genome && gtf) {
        if (genomes[genome].gtf != gtf) {
            make_bed = true
        }
    }

    if (make_bed) {
        // ch_gtf emits tuples of (meta, gtf)
        // Extract the GTF file path from the tuple
        ch_gtf_path = ch_gtf.map { tuple -> tuple[1] }

        // Pass the GTF file path to GTF2BED
        ch_gene_bed = GTF2BED(ch_gtf_path).bed
        ch_versions = ch_versions.mix(GTF2BED.out.versions)
    } else {
        if (gene_bed.endsWith('.gz')) {
            ch_gene_bed = GUNZIP_GENE_BED ( [ [:], gene_bed ] ).gunzip.map { it[1] }
            ch_versions = ch_versions.mix(GUNZIP_GENE_BED.out.versions)
        } else {
            ch_gene_bed = Channel.value(file(gene_bed))
        }
    }

    //
    // Create chromosome sizes file
    //
    CUSTOM_GETCHROMSIZES(ch_fasta.map { [[:], it] })
    ch_chrom_sizes = CUSTOM_GETCHROMSIZES.out.sizes.map { it[1] }
    ch_fai         = CUSTOM_GETCHROMSIZES.out.fai.map { it[1] }
    ch_versions    = ch_versions.mix(CUSTOM_GETCHROMSIZES.out.versions)

    //
    // Prepare genome intervals for filtering by removing regions in blacklist file
    //
    ch_genome_filtered_bed = Channel.empty()

    GENOME_BLACKLIST_REGIONS(
        ch_chrom_sizes,
        ch_blacklist.ifEmpty([])
    )
    ch_genome_filtered_bed = GENOME_BLACKLIST_REGIONS.out.bed
    ch_versions = ch_versions.mix(GENOME_BLACKLIST_REGIONS.out.versions)

    //
    // Uncompress BWA index or generate from scratch if required
    //
    ch_bwa_index = Channel.empty()
    if (prepare_tool_index == 'bwa') {
        if (bwa_index) {
            if (bwa_index.endsWith('.tar.gz')) {
                ch_bwa_index = UNTAR_BWA_INDEX([[:], bwa_index]).untar
                ch_versions  = ch_versions.mix(UNTAR_BWA_INDEX.out.versions)
            } else {
                ch_bwa_index = [[:], file(bwa_index)]
            }
        } else {
            ch_bwa_index = BWA_INDEX(ch_fasta.map { [[:], it] }).index
            ch_versions  = ch_versions.mix(BWA_INDEX.out.versions)
        }
    }

    //
    // Uncompress Bowtie2 index or generate from scratch if required
    //
    ch_bowtie2_index = Channel.empty()
    if (prepare_tool_index == 'bowtie2') {
        if (bowtie2_index) {
            if (bowtie2_index.endsWith('.tar.gz')) {
                ch_bowtie2_index = UNTAR_BOWTIE2_INDEX([[:], bowtie2_index]).untar
                ch_versions      = ch_versions.mix(UNTAR_BOWTIE2_INDEX.out.versions)
            } else {
                ch_bowtie2_index = [[:], file(bowtie2_index)]
            }
        } else {
            ch_bowtie2_index = BOWTIE2_BUILD(ch_fasta.map { [[:], it] }).index
            ch_versions      = ch_versions.mix(BOWTIE2_BUILD.out.versions)
        }
    }

    //
    // Uncompress CHROMAP index or generate from scratch if required
    //
    ch_chromap_index = Channel.empty()
    if (prepare_tool_index == 'chromap') {
        if (chromap_index) {
            if (chromap_index.endsWith('.tar.gz')) {
                ch_chromap_index = UNTARFILES([[:], chromap_index]).files
                ch_versions      = ch_versions.mix(UNTARFILES.out.versions)
            } else {
                ch_chromap_index = [[:], file(chromap_index)]
            }
        } else {
            ch_chromap_index = CHROMAP_INDEX(ch_fasta.map { [[:], it] }).index
            ch_versions      = ch_versions.mix(CHROMAP_INDEX.out.versions)
        }
    }

    //
    // Uncompress STAR index or generate from scratch if required
    //
    ch_star_index = Channel.empty()
    if (prepare_tool_index == 'star') {
        if (star_index) {
            if (star_index.endsWith('.tar.gz')) {
                ch_star_index = UNTAR_STAR_INDEX([[:], star_index]).untar.map { it[1] }
                ch_versions   = ch_versions.mix(UNTAR_STAR_INDEX.out.versions)
            } else {
                ch_star_index = Channel.value(file(star_index))
            }
        } else {
            ch_star_index = STAR_GENOMEGENERATE(ch_fasta, ch_gtf).index
            ch_versions   = ch_versions.mix(STAR_GENOMEGENERATE.out.versions)
        }
    }

    emit:
    fasta          = ch_fasta                   // path: genome.fasta
    fai            = ch_fai                     // path: genome.fai
    gtf            = ch_gtf                     // path: genome.gtf
    gene_bed       = ch_gene_bed                // path: gene.bed
    chrom_sizes    = ch_chrom_sizes             // path: genome.sizes
    filtered_bed   = ch_genome_filtered_bed     // path: *.include_regions.bed
    bwa_index      = ch_bwa_index               // path: bwa/index/
    bowtie2_index  = ch_bowtie2_index           // path: bowtie2/index/
    chromap_index  = ch_chromap_index           // path: genome.index
    star_index     = ch_star_index              // path: star/index/
    versions       = ch_versions.ifEmpty(null)  // channel: [ versions.yml ]
}

Dec 04 '24 19:12 a1ultima

Thank you very mush. It seems a little complex for me. Do you have a plan to fix this directly in the repository https://github.com/nf-core/chipseq/blob/master/subworkflows/local/prepare_genome.nf so that others can get rid of this bug forever?

Dec 10 '24 10:12 ljw20180420

@ljw20180420 I am not part of their dev team, but there are multiple other errors I am debugging. Currently there have been 5-6 bugs I ran into, and I plan to make a PR request for each fix for them to incorporate

Dec 10 '24 12:12 a1ultima