Liz G
Liz G
This issue is now in lines 80-90 of https://github.com/wellcometrust/reach/blob/master/reach/airflow/tasks/fuzzy_match_refs.py i.e. ``` res = self.es.search( index=self.es_index, body=body, size=1 ) matches_count = res['hits']['total']['value'] if matches_count == 0: return best_match = res['hits']['hits'][0] ```...
@ivyleavedtoadflax ok that makes sense re outputs. In terms of the instantiation of the model, is it not true that ``` splitter_parser = SplitParser(config_file=MULTITASK_CFG) ``` instantiates the model and then...
yeh i think that makes sense. Does it kind of make sense that adding lots of Rodrigues data to the training data dips model performance because it was created for...
@ivyleavedtoadflax I see this is the latest deep reference parser wheel in S3: `https://s3.console.aws.amazon.com/s3/object/datalabs-public/deep_reference_parser/deep_reference_parser-2020.3.1-py3-none-any.whl?region=eu-west-2` it was uploaded on the 18th march. Does that mean there is no wheel for the...
this might be a good way to automatically release and add attributes btw https://github.com/marketplace/actions/automatic-releases
@jdu thanks for the analysis and info, I didn't expect it to be the regex!
Also there are scraped documents with the same file hash. Checking in the RDS data just now - there are 205,234 scraped policy docs, but 143,142 unique file hashs.
An example of this is the policy document http://www.fao.org/3/I9553EN/i9553en.pdf which has the document ID 818aff942d9813d338fe31828ee9452a and also 14748f5b61ec161bc226354edbeee7f1
nice to see the analysis! Just a comment on my idea to use url as the unique identifier - it would need cleaning/also has uniqueness problems! (just reminded of this...
@jdu thanks for the info, good to know. It's tricky in my comparison work since I need a unique identifier to link the Uber policy documents with the Reach ones....