bunny icon indicating copy to clipboard operation
bunny copied to clipboard

Bunny GDC support

Open sinisa88 opened this issue 9 years ago • 18 comments

Umbrella for specific issues regarding support for GDC workflow.

  • Tools with no inputs: #177

sinisa88 avatar Mar 02 '17 11:03 sinisa88

The merge_sqlite tool of the GDC DNASeq workflow depends upon this feature: https://github.com/rabix/bunny/issues/193

jeremiahsavage avatar Mar 07 '17 21:03 jeremiahsavage

The picard_mergesamfiles tool depends upon using $(self.basename) in the workflow ( https://github.com/NCI-GDC/gdc-dnaseq-cwl/blob/master/workflows/dnaseq/transform.cwl#L632 ).

Issue filed as https://github.com/rabix/bunny/issues/197

jeremiahsavage avatar Mar 08 '17 18:03 jeremiahsavage

The samtools_idxstats_to_sqlite tool depends upon a literal valueFrom passed to the metrics subworkflow: https://github.com/NCI-GDC/gdc-dnaseq-cwl/blob/master/workflows/dnaseq/transform.cwl#L603

Issue filed as https://github.com/rabix/bunny/issues/202

jeremiahsavage avatar Mar 10 '17 19:03 jeremiahsavage

I've tried to run GDC (transform.cwl) workflow with code from feature/gdc branch, but got the following error:

zcat SRR622461_2.fq.gz | /usr/local/bin/fastq_remove_duplicate_qname - | gzip - > SRR622461_2.fq.gz: not found

Good news is that I got the same error with cwltool :) but I'm not sure what should I do to run this workflow successfully.

StarvingMarvin avatar Mar 17 '17 16:03 StarvingMarvin

Thanks for testing.

I currently don't see that branch on our repo: https://github.com/NCI-GDC/gdc-dnaseq-cwl/tree/feature/gdc (gives 404)

The master branch should run transform.cwl without error using cwltool https://github.com/NCI-GDC/gdc-dnaseq-cwl/tree/master

If you are familiar with dockstore, that might be an easier way to run a workflow from the master branch: https://dockstore.org/workflows/NCI-GDC/gdc-dnaseq-cwl/GDC_DNASeq

jeremiahsavage avatar Mar 17 '17 18:03 jeremiahsavage

It's a bunny branch with some gdc related fixes.

StarvingMarvin avatar Mar 17 '17 18:03 StarvingMarvin

Ok. If you've found some fixes needed in the cwl, I'd certainly like to look at them for possible incorporation.

For now, the below command is tested with work with cwltool version 1.0.20170309164828

mkdir tmp cache
nohup cwltool --debug --tmpdir-prefix tmp/ --cachedir cache/ --custom-net host https://raw.githubusercontent.com/NCI-GDC/gdc-dnaseq-cwl/master/workflows/dnaseq/etl_http.cwl https://raw.githubusercontent.com/NCI-GDC/gdc-dnaseq-cwl/master/workflows/dnaseq/etl_http_NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.json &

jeremiahsavage avatar Mar 17 '17 20:03 jeremiahsavage

@jeremiahsavage I tried running it that way too, but failed with same error. Will continue debugging...

StarvingMarvin avatar Mar 20 '17 16:03 StarvingMarvin

@StarvingMarvin I reproduced your error. It was the same error I encountered here: https://github.com/rabix/bunny/issues/140

Which was fixed in the gdc cwl in this commit: https://github.com/NCI-GDC/gdc-dnaseq-cwl/pull/45

It looks like cwltool has become more strict in properly catching this error, while it used to be more lenient. A current checkout from master should fix this.

jeremiahsavage avatar Mar 20 '17 17:03 jeremiahsavage

bunny hangs when attempting to merge BAMs from multiple arrays. Issue filed as https://github.com/rabix/bunny/issues/215

jeremiahsavage avatar Mar 20 '17 23:03 jeremiahsavage

Yup. I was having for some reason a gdc pipeline directory without .git. And it was in another dir, that was a git repo, so when I did git pull, I was updating the wrong thing... Anyhow, I got to a point of picard failing because it run out of memory on my lap top, so I'll run it on another machine and than I'll try to repeat again with bunny...

StarvingMarvin avatar Mar 21 '17 16:03 StarvingMarvin

I've implemented support for scattering over empty arrays and now Bunny executes the GDC workflow. I'll merge changes from bug/empty-list-scatter into the develop branch ASAP.

Here are the results:

{
  "harmonized_bam" : {
    "basename" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bam",
    "checksum" : "sha1$57ec46a349304aa38fcd9665ca8b3ac07f988c61",
    "class" : "File",
    "dirname" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates",
    "format" : "edam:format_2572",
    "location" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bam",
    "nameext" : "bam",
    "nameroot" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn",
    "path" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bam",
    "secondaryFiles" : [ {
      "basename" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bai",
      "checksum" : "sha1$c7892e603ed183288df74680ea8451a2e82502d1",
      "class" : "File",
      "dirname" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates",
      "location" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bai",
      "nameext" : "bai",
      "nameroot" : "NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn",
      "path" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/picard_markduplicates/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211_gdc_realn.bai",
      "secondaryFiles" : [ ],
      "size" : 4280832
    } ],
    "size" : 314389542
  },
  "sqlite" : {
    "basename" : "123e4567-e89b-12d3-a456-426655440000.db",
    "checksum" : "sha1$a90f05691220aebec4ce7c05fafaa0271567ccc2",
    "class" : "File",
    "dirname" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/merge_all_sqlite",
    "format" : "edam:format_3621",
    "location" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/merge_all_sqlite/123e4567-e89b-12d3-a456-426655440000.db",
    "nameext" : "db",
    "nameroot" : "123e4567-e89b-12d3-a456-426655440000",
    "path" : "/home/janko.simonovic/rabix/rabix-backend-local-1.0.0-rc3/gdc-dnaseq-cwl/workflows/dnaseq/83fca9a7-95f5-4cfd-8df3-f2b179263f9c/root/transform/merge_all_sqlite/123e4567-e89b-12d3-a456-426655440000.db",
    "secondaryFiles" : [ ],
    "size" : 2518016
  }
}

simonovic86 avatar Apr 05 '17 08:04 simonovic86

One minor issue we had to fix: NCI-GDC/gdc-dnaseq-cwl#51

StarvingMarvin avatar Apr 05 '17 10:04 StarvingMarvin

@simonovic86 Fantastic! That fix also enabled me to run the transform to completion. I ran the bunny-generated metrics/sqlite file through a validator, which showed the BAM file contains the same alignments generated with cwltool. It's a highly parallel engine. Thank you.

I'm trying next to run our internal ETL process, which wraps the transform with curl and aws cp commands. But it looks like, by default, I can't get network traffic out of the docker containers launched by bunny (Could not resolve host errors). With cwltool, I can add a cwltool --custom-net host to the command line, which is converted to a docker run --net=host parameter. I've looked in the config options https://github.com/rabix/bunny/blob/master/rabix-backend-local/config/core.properties but don't see one to set docker to use host networking. Is there a way to set that? Thanks.

jeremiahsavage avatar Apr 06 '17 01:04 jeremiahsavage

@jeremiahsavage bunny doesn't do anything to prevent network connectivity from container. Let's take a step back and figure out what the actual problem is. I would dare to guess that the thing you are trying to fetch is on the internal network. If so, than either a docker daemon is configured to provide containers with some public DSN server, or maybe a resolv.conf file ended up inside the image and is messing things up.

If I'm understanding docker networking correctly, the only place where using host network should matter is when binding ports In that case, relying on command line options for it to work would compromise app portability, so I'm a bit disappointed that the reference implementation does this. Proper way would be introducing a new requirement for it, or extending a DockerRequirement.

StarvingMarvin avatar Apr 06 '17 14:04 StarvingMarvin

@StarvingMarvin The docker image in this case is minimal: https://github.com/NCI-GDC/curl_docker/blob/master/Dockerfile

I agree specifying this in cwl would probably be the best way to go. For now, I am able to get this tool to work with the following change to bunny. https://github.com/jeremiahsavage/bunny/commit/4cfec1200bb30feaed1f0f2a712e8b25e5fb4a67

as documented https://docs.docker.com/engine/userguide/networking/ I think "bridge" is the default mode.

jeremiahsavage avatar Apr 06 '17 19:04 jeremiahsavage

@jeremiahsavage I'm still confused about what is about your networking situation that demands host network? What are you trying to fetch that can't be accessed through a bridge network? Which of those two commands fail curl or aws cp? What is a dns setting of docker daemon?

StarvingMarvin avatar Apr 07 '17 07:04 StarvingMarvin

@StarvingMarvin We have had to use host networking instead of bridge networking ever since switching to https://apt.dockerproject.org/repo/ instead of the older 1.6 docker in Ubuntu's Trusty. It seems to be a regression in that build, or a change made in docker after version 1.6 is preventing us from using bridge.

curl is the command that reliably fails. I believe aws will fail, as well. But there is a separate issue there I am narrowing down.

jeremiahsavage avatar Apr 07 '17 21:04 jeremiahsavage