seqsender icon indicating copy to clipboard operation
seqsender copied to clipboard

GISAID covCLI bugfixes & BioSample/SRA modifications

Open erikwolfsohn opened this issue 1 year ago • 0 comments

Edit: added some BioSample/SRA workflow changes and attached my config.yaml and metadata files in case you wanted to test against them. example_config.yaml.txt example_metadata.csv

Hey Dakota! Thanks for getting this new release out. The new handling for BioSample packages is amazing.

I've tested the BioSample/SRA, and GISAID covCLI workflows - the NCBI workflows worked perfectly, but I ran into a bunch of submission failures testing covCLI.

I made some modifications and now my GISAID submissions are going through reliably. I haven't had time to do any serious testing so I can't say if all these changes will hold up, but I wanted to go ahead and submit a pull request in case there's anything that might be helpful.

🛠️ covCLI bugfixes

  • Modify several interactions w/the metadata sheet column headers
  • Adjust logfile parsing to properly capture the sample name and EPI ID from the submission log
  • Address issues causing some functions to exit prematurely when passed empty inputs (when only GISAID isolates or GISAID segments are present)

📋 BioSample & SRA updates

  • Drop empty bs- & sra- columns before metadata validation. Pandera will ignore optional columns if they don't exist - if they exist but have no data pandera will attempt to validate them and the workflow will fail. If a mandatory column has been left empty by the user and is removed by this change, pandera will still catch it.
  • If bs-description is absent, create it by joining values from organism, bs-host, bs-host_disease. bs-description isn't in the Submission Wizard template and is not required by NCBI for submission, but the workflow will fail when it isn't present.
  • Warn the user if performing BioSample and SRA submissions together with Link_Sample_Between_NCBI_Databases disabled. SRA submission is likely to fail in this situation unless the user has added BioSample accessions manually beforehand.

Testing data was generated via:

python seqsender.py test_data --biosample --sra --gisaid --organism COV --submission_dir test_data/CCPHL/

Metadata and config templates were created with the Shiny app Submission Wizard

And the workflow was run with this command:

python seqsender.py submit --biosample --sra --gisaid --organism COV --submission_dir test_data/CCPHL/ --submission_name COV_TEST_DATA --config_file test_data/CCPHL/cov_ccphl_config.yaml --metadata_file test_data/CCPHL/meta2.csv --fasta_file test_data/CCPHL/sequence.fasta --test

erikwolfsohn avatar Aug 15 '24 12:08 erikwolfsohn