Add CCHFV to nextclade_data.
Using the https://github.com/neherlab/CCHFV repository and NCBI Virus I was able to create nextclade_data sets for CCHFV which can then be used by nextclade run.
Auspice trees for the three segments can be built
-
- independently from each other
-
- dependent on each other (choosing only samples with all segments to allow for the creation of tanglegrams)
-
- dependent on each other with additional recombination site inference (using TreeKnit to infer ARGs we can better estimate branch lengths).
I chose the second option for now - but this can be changed.
Additionally, I chose to only name 3 genes: RdRp(RNA-dependent RNA polymerase, product: putative polyprotein) and GCP (product: glycoprotein precursor) and NP (product: nucleoprotein).
Potentially we would like to also name the non-structural S protein (NSS), details in https://www.mdpi.com/1999-4915/8/4/106.
@anna-parker I fixed a few bugs in file declarations and added changelogs. We disabled automated CI for forks due to security concerns, so I pushed the processed files (data_output/) myself.
The datasets can be accessed if you provide --server CLI arg or dataset-server URL param:
https://clades.nextstrain.org/?dataset-server=gh:anna-parker/nextclade_data@cchfv@data_output&dataset-name=nextstrain/cchfv/linked/S
If you have access to nextstrain org, then it makes sense to work directly in the nextstrain/nextclade_data repo. This way checks will run automatically. If you don't have it, Richard can probably arrange it.
Please thoroughly consider:
- which collection/organization you want this dataset to be in. Right now it's in
nextstraincollection, even though you are pushing from a fork. For third parties we recommend usingcommunitycollection. This is mostly political, and to avoid dramas like: who will be allowed to make changes? who will maintain it? who decides what the clades/lineages are if there's no consensus? - path of each dataset. In particular with relation of what clades/flavors/hosts are there now and which ones you want to add in the future. This is a technical & bioinformatics decision. Paths are immutable you cannot change paths or delete datasets later. See the
docs/for more details. - if the dataset is not well tested and/or if there's any concerns with regards to quality or correctness, then it is appropriate to set
.attributes.experimental = trueinpathogen.json
Thanks so much @ivan-aksamentov! @rneher do you have any concerns about CCHFV being in the nextstrain collection? We can also discuss offline if that is easier.
Ummm, just raising a flag that a rebase or something might have gone awry here? Unless 209 commits / 1099 changed files / 8+ million lines added is the expected diff…
Ummm, just raising a flag that a rebase or something might have gone awry here? Unless 209 commits / 1099 changed files / 8+ million lines added is the expected diff…
Oh thanks for pointing that out - I didn't think I had actually pushed anything - I will just close this PR and start again I think that is easier at this point