LAPIS icon indicating copy to clipboard operation
LAPIS copied to clipboard

baseline filter

Open aswarren opened this issue 2 years ago • 5 comments

Pulling down surveillance from the API includes all sequences no matter the reason. In the case of the US / GISAID this includes traveller surveillance, which if estimating prevalence for a particular area, can give a very different picture than domestic spread. Is there a way to filter sequences based on baseline sequencing tag? If not it would be useful to have.

aswarren avatar Aug 10 '23 13:08 aswarren

We have a field samplingStrategy. You can see the available tags using fields=samplingStrategy, e.g., at https://lapis.cov-spectrum.org/open/v1/sample/aggregated?fields=samplingStrategy.

chaoran-chen avatar Aug 10 '23 14:08 chaoran-chen

Awesome! Thanks! Is there a field guide for explanation of A, X, Y, N?

{"errors":[],"info":{"apiVersion":1,"dataVersion":1690103788,"deprecationDate":null,"deprecationInfo":null,"acknowledgement":null},"data":[{"samplingStrategy":"A","count":48019},{"samplingStrategy":"X","count":192119},{"samplingStrategy":"Y","count":44101},{"samplingStrategy":"N","count":314101},{"samplingStrategy":null,"count":7683436}]}

aswarren avatar Aug 10 '23 14:08 aswarren

Is there a field guide for explanation of A, X, Y, N?

The fields A,X,Y,N are shown only for data pulled from RKI (Germany's CDC) as opposed to Genbank. Their README is here: https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland

image

It's a bit scrambled, the sentences seem incomplete. I would say: X: unknown whether targeted or not Y: sequencing done potentially due to interesting mutations/variant PCR A: Variant PCR suggested something of interest N: Representative sampling

I'm not sure about how reliable the annotation is though. I remember that when I looked into it a year ago, it seemed like representative sampling wasn't necessarily representative.

I think the field was introduced back in the day when labs started to do variant PCRs to get a quick idea of which variant a patient - as variant PCR was as fast as PCR and less delay than waiting for whole genome sequencing.

corneliusroemer avatar Aug 10 '23 14:08 corneliusroemer

Ah thanks very much to you both. Since @chaoran-chen example uses the open API, I also was also wondering about the binding from the "purpose_of_sampling" tag in NCBI to the codes explained by @corneliusroemer 's link? One example where the baseline tag ends up mattering in the US, is the CDC sequencing nasal swabs vs traveller surveillance. In previous months when pulling down the surveillance via API the growth curve of XBB.1.16 looked much more aggressive in domestic surveillance because traveller surveillance was being included. If I were estimating prevalence in a state I likely wouldn't want to include people landing at the airport domestic/international. That motivated my initial question about the ability to filter since presumably traveller surveillance wouldn't qualify for baseline or might be distinguishable in some way via that field. On NCBI the purpose_of_sampling can be accessed via CLI like so: $ datasets summary virus genome taxon sars-cov-2 --released-after 05/20/2023 | jq -r '.reports[] | select(.purpose_of_sampling != null) | [.accession,.purpose_of_sampling,.isolate.name] | @tsv' >ncbi_baseline.tsv Most of that command line magic was provided by Eric Cox at NCBI-Datasets

aswarren avatar Aug 10 '23 17:08 aswarren

Ah very nice @aswarren! The open data comes gets to LAPIS via nextstrain/ncov-ingest and I don't think we currently use that purpose_of_sampling field there - though we definitely should.

corneliusroemer avatar Aug 10 '23 19:08 corneliusroemer