augur icon indicating copy to clipboard operation
augur copied to clipboard

augur curate apply-geolocation-rules

Open joverlee521 opened this issue 3 years ago • 1 comments

Context

We have a great list of manually curated geolocation rules in ncov-ingest that would be helpful to be used as standard geolocation rules that can be applied to other pathogen data curation efforts.

How the geolocation rules work

Instead of keeping track of manual annotations for locations on a per record/strain basis, these geolocation rules can be applied across all records. The geolocation rules TSV follows this format:

region/country/division/location<\t>region/country/division/location

The first set of locations are the expected "raw" locations that are in the metadata and the second set of locations after the tab are the "annotated" locations that will be applied to the metadata. Each geo resolution (region, country, division, location) is expected to be a column in the metadata. By using the region/country/division/location hierarchy, we ensure that locations with the same name are treated differently based on their full hierarchy. If there are rules that can be applied across multiple locations, then a wildcard (*) can be used instead of a specific value.

This might make more sense by walking through a specific example. Let's say you have the following locations in your metadata:

region country division location
North America United States New York Buffalo
North America United States New York New York

You want to do the following:

  • Change the location "New York" to "New York City"
  • Change the division "New York" to "New York State"
  • Change the country "United States" to "USA"

You can achieve this with these geolocation rules:

North America/United States/New York/New York<\t>North America/United States/New York/New York City
North America/United States/New York/*<\t>North America/United States/New York State/*
North America/United States/*/*<\t>North America/USA/*/*

The first rule will looks for the specific hierarchy to correct the location to "New York City". The second rule has a wildcard as the location, so it will correct all applicable divisions to "New York State". The third rule has wildcards for both division and location, so it will correct all applicable countries to "USA".

Description

augur curate apply-geolocation-rules can adopt the apply-geolocation-rules script from monkeypox/ingest. The script allows users to supply multiple geolocation rule TSVs so that they can use "standard" geolocation rules but also provide their own rules that can override the general rules. We can port over the geolocation rules from ncov-ingest as a starting point for augur's default "standard' rules. There are some GISAID specific rules in there that can be removed over time, but there are ~43000 rules in the file as of 2022-07-07.

@victorlin has also started some unit tests which can be ported over and help with understanding of the underlying logic.

joverlee521 avatar Jul 07 '22 22:07 joverlee521

During discussion today, @corneliusroemer mentioned that we should standardize our admin divisions based on ISO codes. Previously, @rneher has suggested using something like pycountry to make more use of ISO codes.

joverlee521 avatar Oct 04 '23 19:10 joverlee521

I'll take this one which is effectively copying a file over, moving tests from my stale PR, and translating a couple more tests mentioned in that PR's checklist.

victorlin avatar Jun 25 '24 23:06 victorlin

Closing this issue as complete since the new command has been added. The remaining tasks will be tracked in https://github.com/nextstrain/ingest/issues/43.

joverlee521 avatar Jul 02 '24 19:07 joverlee521