raredisease Update CADD to version 1.7

Cadd version 1.7 has been released. Among other update the scoring now also uses information from protein language models. See paper here

Pre-computed scores can be found here https://cadd.bihealth.org/download

Feb 06 '24 15:02 jemten

I've done some work on trying to package CADD v1.7 and have not succeeded. I'd just like to post the information here, in case it may be helpful.

The approach with CADD v1.6.post1 was to run the documented command to download the conda environments into the docker container (https://github.com/BioContainers/containers/blob/60ba043b6e419b33b385d9cc4f22375a69890d84/cadd-scripts-with-envs/1.6.post1/Dockerfile#L45). In version 1.7 it didn't download all of the necessary environments, so I couldn't use the same approach.

Recently, CADD released v1.7.1. CADD now support using singularity images for the snakemake pipeline instead of using conda environments. It fixes some other bugs, and also adds a docker image with the ~~singularity images~~ conda environments.

Attempt 1: Based on CADD's docker image: https://github.com/fa2k/BioContainers-fork/blob/cadd-1.7/cadd-scripts-with-envs/1.7.1/Dockerfile - does not successfully load the conda environments because they exist at the wrong path.

Attempt 2: Create conda environments manually in a loop: https://github.com/fa2k/BioContainers-fork/blob/cadd-1.7/cadd-scripts-with-envs/1.7.1/Dockerfile-full

Attempt 2 produces a 24GB docker image that can successfully execute some CADD commands when combined with the modified cadd module here: https://github.com/fa2k/raredisease/blob/caddtest/modules/nf-core/cadd/main.nf

The linked CADD module contains some additional work-arounds.

The current iteration crashes in snakemake rule annotate_regseq on command:


          python /opt/CADD-scripts-1.7.1/src/scripts/lib/tools/regulatorySequence/predictVariants.py         --variants /tmp/tmp.UBUhIOYTIu/NA12878_rhocall_vcfanno_filter_0004-scattered_indels.esm.vcf.gz         --model data/annotations/GRCh38_v1.7/regseq/Hyperopt400InclNegatives.json         --weights data/annotations/GRCh38_v1.7/regseq/Hyperopt400InclNegatives.h5         --reference data/annotations/GRCh38_v1.7/regseq/reference.fa         --genome data/annotations/GRCh38_v1.7/regseq/reference.fa.genome         --output /tmp/tmp.UBUhIOYTIu/NA12878_rhocall_vcfanno_filter_0004-scattered_indels.regseq.vcf.gz &> /tmp/tmp.UBUhIOYTIu/NA12878_rhocall_vcfanno_filter_0004-scattered_indels.annotate_regseq.log

with error message:

...
vcfpy.exceptions.IncorrectVCFFormat: Ill-formatted line starting with "#CHROM"

The input VCF to this rule is missing the FORMAT column.

I'm about to give up for a while on CADD, because there are too many problems. But I thought it may help to share this progress, and maybe someone has some tips for how to continue trying.

Aug 14 '24 15:08 fa2k

Thanks for taking the time to test @fa2k. We'll try to pick this up after the release of the next version

Aug 19 '24 09:08 jemten