Jeff Uren

Results 39 comments of Jeff Uren

I'll have a look at what exactly needs to be made public and see if I can come up with a concrete list based on the different backends. I can...

We can potentially instrument this with cProfile to see what's eating up all the time while it's processing. I can have a look at running this next week wrapped with...

This looks like it's down to the use of regex in `exact_match.py` for substitutions, each time the `clean_text` method is called, it's recompiling the regular expressions. Regex isn't the fastest...

@liz also worth noting if you have a license for pycharm or intellij there's a profiling GUI built on top of cProfile built into the IDE which you can use...

I still have the pyprof so had a quick look at the regex compilation only happens about 15 times across the life of the script so it is caching it,...

It's possible it could change, but not sure the cases where this would be the case apply here. For instance if there was a process on their end which opened...

If you have the original file from your gold data you could calc a hash of it, pull the same file from S3 and calc the hash on it and...

Also, if you opened the file in preview and then calculated the MD5 hash in order to set up the gold data set, your MD5 hash calculated locally might differe...

Already tested (was curious to see for myself in any case) and at least with Preview in macos it doesn't change the file hash, even if I open and save...

The fact you're getting some as well is weird. Because if it was something in the pipeline that was changing the hash I would think it would be all or...