Liz G comments

Results 14 comments of


                                            Liz G

Consider fuzzy match randomness

This issue is now in lines 80-90 of https://github.com/wellcometrust/reach/blob/master/reach/airflow/tasks/fuzzy_match_refs.py i.e. ``` res = self.es.search( index=self.es_index, body=body, size=1 ) matches_count = res['hits']['total']['value'] if matches_count == 0: return best_match = res['hits']['hits'][0] ```...

Consider spans in output

@ivyleavedtoadflax ok that makes sense re outputs. In terms of the instantiation of the model, is it not true that ``` splitter_parser = SplitParser(config_file=MULTITASK_CFG) ``` instantiates the model and then...

Poor performance when running splitter model

yeh i think that makes sense. Does it kind of make sense that adding lots of Rodrigues data to the training data dips model performance because it was created for...

Poor performance when running splitter model

@ivyleavedtoadflax I see this is the latest deep reference parser wheel in S3: `https://s3.console.aws.amazon.com/s3/object/datalabs-public/deep_reference_parser/deep_reference_parser-2020.3.1-py3-none-any.whl?region=eu-west-2` it was uploaded on the 18th march. Does that mean there is no wheel for the...

Poor performance when running splitter model

this might be a good way to automatically release and add attributes btw https://github.com/marketplace/actions/automatic-releases

Exact matching is slow on local run of refparse

@jdu thanks for the analysis and info, I didn't expect it to be the regex!

File hashes are not consistent between Reach runs

Also there are scraped documents with the same file hash. Checking in the RDS data just now - there are 205,234 scraped policy docs, but 143,142 unique file hashs.

File hashes are not consistent between Reach runs

An example of this is the policy document http://www.fao.org/3/I9553EN/i9553en.pdf which has the document ID 818aff942d9813d338fe31828ee9452a and also 14748f5b61ec161bc226354edbeee7f1

File hashes are not consistent between Reach runs

nice to see the analysis! Just a comment on my idea to use url as the unique identifier - it would need cleaning/also has uniqueness problems! (just reminded of this...

File hashes are not consistent between Reach runs

@jdu thanks for the info, good to know. It's tricky in my comparison work since I need a unique identifier to link the Uber policy documents with the Reach ones....