nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Fusion symlink resolution doesn't work with directories

Open bentsherman opened this issue 2 years ago • 4 comments

We recently added some logic to handle Fusion symlinks, which happen when an input file is included in a published output. We use the .fusion.symlinks file to detect and resolve any Fusion symlinks when they are published.

However, when publishing a directory, Nextflow does not walk the directory tree and publish each file individually, it just publishes the directory. If Fusion symlinks were staged into this directory, they will not be detected and resolved.

Given a list of input files:

$ ls -1 freshdesk/fd-4463/files/ 
1.txt
2.txt
3.txt
4.txt
5.txt

The following pipeline script will demonstrate this issue:

process AGGREGATE {
  container "quay.io/nextflow/bash"
  publishDir "results", mode: "copy"

  input:
  path(samples), stageAs: 'AnalysisFiles/'

  output:
  path("*")

  script:
  """
  for name in AnalysisFiles/*.txt; do
    touch AnalysisFiles/Analysis_on_\$(basename \${name} .txt)
  done
  """
}

workflow {
    AGGREGATE( files("files/*") )
}

One workaround that I tried is to make the outputs more explicit:

  output:
  path("AnalysisFiles/*.txt", includeInputs: true)
  path("AnalysisFiles/Analysis_on_*")

But this doesn't quite work because of a small bug in the symlink resolution. I will submit a patch so that at least this workaround works.

A more permanent solution would be to walk the directory tree and publish each file explicitly instead of only publishing the directory (#3933). There are other benefits to making this change, see also #3372.

bentsherman avatar Feb 08 '24 18:02 bentsherman

Here is another workaround that works with the current version of Nextflow:

process AGGREGATE {
  container "quay.io/nextflow/bash"
  publishDir "results/AnalysisFiles", mode: "copy"

  input:
  path("*.txt")

  output:
  path("*.txt", includeInputs: true)
  path("Analysis_on_*")

  script:
  """
  for name in *.txt; do
    touch Analysis_on_\$(basename \${name} .txt)
  done
  """
}

So there is something to be said for trying to keep the task directory as flat as possible and using the publish mechanism to create the desired directory structure. Still, this may not be possible or easy to do with certain tools, so I would still like to fix it in Nextflow.

bentsherman avatar Feb 08 '24 18:02 bentsherman

Hi,

I leave this message to encourage finding a solution to the issue.

In our case, we have created a wrapper for a Python pipeline to enable running each step with different computational resources, using Nextflow without modifying the original code. The directory output of each step is used as input for the next step. The folder structure is complex and dependent on various parameters, therefore, the suggested solutions may not solve our problem.

I can always write a bash script that gets the directory path from the trace and places it into the output folder, but this solution is not ideal and I would really like nextflow and fusion to do it for me.

IreneRobles avatar Feb 14 '24 16:02 IreneRobles

Hi Irene, this issue will be solved by #3933 and #4726. Just waiting for the PRs to be approved and merged

bentsherman avatar Feb 14 '24 16:02 bentsherman

That is fantastic, thank you so much for the update!

IreneRobles avatar Feb 14 '24 16:02 IreneRobles

Hi, I just noticed that my published directories are now being correctly published in the output folder using the fusion system. I think the software must have been patched.

IreneRobles avatar Feb 22 '24 10:02 IreneRobles

Did you implement a workaround?

bentsherman avatar Feb 22 '24 13:02 bentsherman

This occurred before I implemented any workaround. publish_dir_mode has to be copy, copyNoFollow does not work.

On Thu, 22 Feb 2024 at 13:11, Ben Sherman @.***> wrote:

Did you implement a workaround?

— Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/issues/4725#issuecomment-1959427613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEY4VECBXNM6SCMEROA7RFDYU47WNAVCNFSM6AAAAABDAGSYX2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJZGQZDONRRGM . You are receiving this because you commented.Message ID: @.***>

IreneRobles avatar Feb 22 '24 15:02 IreneRobles