GISAID covCLI bugfixes & BioSample/SRA modifications
Edit: added some BioSample/SRA workflow changes and attached my config.yaml and metadata files in case you wanted to test against them. example_config.yaml.txt example_metadata.csv
Hey Dakota! Thanks for getting this new release out. The new handling for BioSample packages is amazing.
I've tested the BioSample/SRA, and GISAID covCLI workflows - the NCBI workflows worked perfectly, but I ran into a bunch of submission failures testing covCLI.
I made some modifications and now my GISAID submissions are going through reliably. I haven't had time to do any serious testing so I can't say if all these changes will hold up, but I wanted to go ahead and submit a pull request in case there's anything that might be helpful.
🛠️ covCLI bugfixes
- Modify several interactions w/the metadata sheet column headers
- Adjust logfile parsing to properly capture the sample name and EPI ID from the submission log
- Address issues causing some functions to exit prematurely when passed empty inputs (when only GISAID isolates or GISAID segments are present)
📋 BioSample & SRA updates
- Drop empty bs- & sra- columns before metadata validation. Pandera will ignore optional columns if they don't exist - if they exist but have no data pandera will attempt to validate them and the workflow will fail. If a mandatory column has been left empty by the user and is removed by this change, pandera will still catch it.
- If bs-description is absent, create it by joining values from
organism,bs-host,bs-host_disease.bs-descriptionisn't in the Submission Wizard template and is not required by NCBI for submission, but the workflow will fail when it isn't present. - Warn the user if performing BioSample and SRA submissions together with
Link_Sample_Between_NCBI_Databasesdisabled. SRA submission is likely to fail in this situation unless the user has added BioSample accessions manually beforehand.
Testing data was generated via:
python seqsender.py test_data --biosample --sra --gisaid --organism COV --submission_dir test_data/CCPHL/
Metadata and config templates were created with the Shiny app Submission Wizard
And the workflow was run with this command:
python seqsender.py submit --biosample --sra --gisaid --organism COV --submission_dir test_data/CCPHL/ --submission_name COV_TEST_DATA --config_file test_data/CCPHL/cov_ccphl_config.yaml --metadata_file test_data/CCPHL/meta2.csv --fasta_file test_data/CCPHL/sequence.fasta --test