patterns icon indicating copy to clipboard operation
patterns copied to clipboard

Adds pattern describing how to publish files from channel using collectFile

Open winni2k opened this issue 7 years ago • 11 comments

winni2k avatar Aug 09 '18 09:08 winni2k

Thanks for this contribution, however my understanding is that this a variation of collect-into-file pattern, to which is added the storeDir parameter, isn't it?

In that case I would just add a note to the collect-into-file pattern.

pditommaso avatar Aug 09 '18 14:08 pditommaso

Maybe. If so, then the problem statement of collect-into-file is too narrow and would need adjusting. I think a code example would also be helpful.

All I can say is that as a new user of nextflow, I did not find that the collect-into-file pattern, or any pattern, helped me figure out how to publish files from a channel into a directory 🤷‍♀️

Perhaps an example in the documentation showing the use of collectFile with storeDir might be sufficient?

winni2k avatar Aug 09 '18 20:08 winni2k

After reading a bit more around the storeDir directive, it looks like what I really want is a publishDir method so that I could write something like unzipped_ch.publishDir('unzipped_files'). Could that be done?

winni2k avatar Aug 09 '18 21:08 winni2k

Sorry for the late reply. I had a summer break.

I'm still a bit lost in this thread, my understanding is that you want a process outputs to be stored in a specific directory. The best way to achieve that is to use the publishDir directive in the process definition, eventually using a pattern to filter only specific files.

However in this PR you are suggesting to use a collectFile operator, that could be used for that, but it's more suggested to collect multiple files into a single one.

pditommaso avatar Aug 24 '18 10:08 pditommaso

First of all: I am a new user of nextflow, so it's likely that I am confused.

What I would like to is to apply arbitrary transformations to the files in a channel, and then to just publish the result to a directory without building a process.

I think this works:

Channel.fromPath('reads/*_1.fq.gz')
    .filter()
    .another_transformation()
    .set {final_output_channel}

process dump_final_output_channel {
    publishDir 'my_results'
    input: 
    file output_file from final_output_channel

    "echo ignore this message"
}

But I find this cleaner:

Channel.fromPath('reads/*_1.fq.gz')
    .filter()
    .another_transformation()
    .collectFile(storeDir: 'my_results')

I would find this even cleaner:

Channel.fromPath('reads/*_1.fq.gz')
    .filter()
    .another_transformation()
    .storeDir('my_results')

winni2k avatar Aug 24 '18 11:08 winni2k

I see, being so the second snippet is the way to go

Channel.fromPath('reads/*_1.fq.gz')
    .filter()
    .another_transformation()
    .collectFile(storeDir: 'my_results')

In this case if you want to contribute a patten, you need to clarify you want to store the result of chain of operators (not the direct output of a process)

pditommaso avatar Aug 24 '18 13:08 pditommaso

Excellent! I think we're getting somewhere. Let me revise this PR...

winni2k avatar Aug 24 '18 16:08 winni2k

@jimhavrilla that's great! I forgot that this PR was still open!

winni2k avatar Jun 26 '20 10:06 winni2k

Actually, @jimhavrilla could you perhaps post a snippet of the code that you ended up writing? Perhaps we could use it in this pattern.

winni2k avatar Jun 26 '20 10:06 winni2k

Sure, super rough, but:

    process filter {
        executor='sge'
        queue='all.q'
        clusterOptions = '-V -cwd -l virtual_free=8G -S /bin/bash'

        tag "${chr}"

        input:
        set chr, sample, bgen from bgen_ch

        output:
        file "${chr}filter.pgen" into pgen_ch
        file "${chr}filter.pvar" into pvar_ch
        file "${chr}filter.psam" into psam_ch

        shell:
        '''
        plink2 --bgen !{bgen} ref-first --sample !{sample} --keep keep.txt
--remove remove.txt --make-pgen --out !{chr}filter --memory 8000
        '''
        }

        pgen_ch
            .collectFile(storeDir: storepath)
        pvar_ch
            .collectFile(storeDir: storepath)
        psam_ch
            .collectFile(storeDir: storepath)

On Fri, Jun 26, 2020 at 6:57 AM Winni Kretzschmar [email protected] wrote:

Actually, @jimhavrilla https://github.com/jimhavrilla could you perhaps post a snippet of the code that you ended up writing? Perhaps we could use it in this pattern.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextflow-io/patterns/pull/12#issuecomment-650118842, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSDYBBVFKRGMOUKTHSNY4DRYR5KPANCNFSM4FOVJZCA .

jimhavrilla avatar Jun 26 '20 20:06 jimhavrilla

Quick question, why is the Channel.collectFile() parameter called "storeDir" and not "publishDir"? In process directives, the "publishDir" process directive is for saving output files, and the "storeDir" directive is for permanent caching of files. I was able to accomplish my goal of "publishing" my files from collectFile using the "storeDir" parameter, but it took me a bit to figure out to use this parameter because I was expecting "storeDir" not to be used for final output of files.

dstrib avatar Aug 28 '20 12:08 dstrib