nextclade_data icon indicating copy to clipboard operation
nextclade_data copied to clipboard

Add CCHFV to nextclade_data.

Open anna-parker opened this issue 1 year ago • 3 comments

Using the https://github.com/neherlab/CCHFV repository and NCBI Virus I was able to create nextclade_data sets for CCHFV which can then be used by nextclade run.

Auspice trees for the three segments can be built

    • independently from each other
    • dependent on each other (choosing only samples with all segments to allow for the creation of tanglegrams)
    • dependent on each other with additional recombination site inference (using TreeKnit to infer ARGs we can better estimate branch lengths).

I chose the second option for now - but this can be changed.

Additionally, I chose to only name 3 genes: RdRp(RNA-dependent RNA polymerase, product: putative polyprotein) and GCP (product: glycoprotein precursor) and NP (product: nucleoprotein).

Potentially we would like to also name the non-structural S protein (NSS), details in https://www.mdpi.com/1999-4915/8/4/106.

anna-parker avatar May 15 '24 18:05 anna-parker

@anna-parker I fixed a few bugs in file declarations and added changelogs. We disabled automated CI for forks due to security concerns, so I pushed the processed files (data_output/) myself.

The datasets can be accessed if you provide --server CLI arg or dataset-server URL param:

https://clades.nextstrain.org/?dataset-server=gh:anna-parker/nextclade_data@cchfv@data_output&dataset-name=nextstrain/cchfv/linked/S

If you have access to nextstrain org, then it makes sense to work directly in the nextstrain/nextclade_data repo. This way checks will run automatically. If you don't have it, Richard can probably arrange it.

ivan-aksamentov avatar May 15 '24 21:05 ivan-aksamentov

Please thoroughly consider:

  • which collection/organization you want this dataset to be in. Right now it's in nextstrain collection, even though you are pushing from a fork. For third parties we recommend using community collection. This is mostly political, and to avoid dramas like: who will be allowed to make changes? who will maintain it? who decides what the clades/lineages are if there's no consensus?
  • path of each dataset. In particular with relation of what clades/flavors/hosts are there now and which ones you want to add in the future. This is a technical & bioinformatics decision. Paths are immutable you cannot change paths or delete datasets later. See the docs/ for more details.
  • if the dataset is not well tested and/or if there's any concerns with regards to quality or correctness, then it is appropriate to set .attributes.experimental = true in pathogen.json

ivan-aksamentov avatar May 15 '24 21:05 ivan-aksamentov

Thanks so much @ivan-aksamentov! @rneher do you have any concerns about CCHFV being in the nextstrain collection? We can also discuss offline if that is easier.

anna-parker avatar May 16 '24 05:05 anna-parker

Ummm, just raising a flag that a rebase or something might have gone awry here? Unless 209 commits / 1099 changed files / 8+ million lines added is the expected diff…

genehack avatar Feb 04 '25 22:02 genehack

Ummm, just raising a flag that a rebase or something might have gone awry here? Unless 209 commits / 1099 changed files / 8+ million lines added is the expected diff…

Oh thanks for pointing that out - I didn't think I had actually pushed anything - I will just close this PR and start again I think that is easier at this point

anna-parker avatar Feb 05 '25 06:02 anna-parker