python-publicsuffix2 get_tld(), get_components(), and other

Reworked the code of get_tld() to reduce the number of temporary objects (reduce the load on GC in the case of bulk operations). The requirement for the input domain name is updated - it allows to have consistent behavior of all methods. The updated requirement: input domain string is a FQDN without trailing dot. Empty labels are not allowed.

get_tld() is split into get_tld() and get_tld_unsafe(). The latter implements the entire logic and the former performs lower() conversion and None checks.
get_tld_unsafe() rewritten to create minimal number of temporary objects
get_sld() split info get_sld() and get_sld_unsafe(). The latter is based on get_tld_unsafe(), same principles.
get_components() and get_components_unsafe() method introduced. These methods allow split the domain string into 3-tuple: (prefix, SLL, TLD) where SLL is second-level label and TLD is TLD or ETLD

Apr 27 '20 14:04 vadym-t

@hiratara @KnitCode ping... your review is much welcomed.

Apr 28 '20 08:04 pombredanne

@vadym-t Thank you ++ That's a big PR and I reckon this would change the API and behavior. I would rather avoid doing so as it would break existing code. Would there be a way to achieve something similar without changing behavior? (e.g. adding new functions or optional args?)

Also you have introduced Python 3-only idioms that make the test fails. For now, I would like to keep things working on Python 2.7 too. Do you think this could be reworked?

Hi. I've worked with Vadym extensively over the last week+ on this logic. He is on our team and identified logical bugs and performance issues. I haven't yet code reviewed the latest version, but tested earlier version extensively w/o issue. To your point, though, didn't test python2.7 at all.

The primary changes are two-fold:

changes to the computation that reduces the creation of temporary objects. this has the same result as the previous code. In my testing, it was slightly faster than the old version on datasets that would fit in memory for get_tld() and get_sld(). for large distributed sets, the timing was the same but less objects created. this helps for Dask/Spark usage of the library.
test changes. vadym dug into the algorithm on the mozilla site and compared to the tests (theirs and ours). there were tests in their suite that are inconsistent and appear to be based on an old PSL list, but passed for us because of default logic in our existing code, which itself was faulty on edge cases. a number of the issues occur at the edge that most users, e.g, web page usage, wouldn't see but we see in very large DNS systems.

Vadym asked what to do -- and I suggested fix the tests to what they really should be and do a PR. In addition, he added a new suite which starts from the published algorithm and addresses all the combinations.

So overall, I think it should be merged and will be an upgrade for everyone. the functionality breaks will occur at the edge for very large-scale DNS type things where you'd see bad behavior and root queries. We plan to adopt in Infoblox Data Science.

will re-review the code and documentation

Apr 28 '20 11:04 KnitCode

_lookup_node doesn't seem to be used anymore.

https://github.com/nexB/python-publicsuffix2/blob/e95f43d26d57692fc1f153c8c9bbde62dabeaae7/src/publicsuffix2/init.py#L201-L240

I'm all for improving the execution speed. However, we are cautious about changing existing behavior. The reason the specs of this library change so often is that the specs are so vague. It's hard to determine if this specification change is reasonable (although I'm not actively opposed to it).

Apr 28 '20 13:04 hiratara

_lookup_node doesn't seem to be used anymore.

https://github.com/nexB/python-publicsuffix2/blob/e95f43d26d57692fc1f153c8c9bbde62dabeaae7/src/publicsuffix2/init.py#L201-L240

I'm all for improving the execution speed. However, we are cautious about changing existing behavior. The reason the specs of this library change so often is that the specs are so vague. It's hard to determine if this specification change is reasonable (although I'm not actively opposed to it).

agree, and my brain hadn't noticed that. @pombredanne I would think the priorities would be in rough order:

correcting faulty logic and unit tests based on the published algorithm; these corrections do a good job of documenting the algorithm and reveal faults in the original Mozilla tests that rely on the existence of certain things (e.g. example) being the psl which aren't. i like the demonstrated combinations using a set list in the use_caeses.md
ensuring backward compatibility, outside of being wrong :) ... so that the functions people are using today work as expected, and added functions give more controls, as needed,
py2.7 support for older use cases,
performance improvements.. that's always a fine thing.

agreed? given that, wondering if the merge can be split to get these. if the performance improvements suggested by @vadym-t only work with py3 idioms and don't use the trie, what about using an extra flag parameter to trigger usage of the new functions?

i can take a closer look on the weekend and rerun the experiments i did with an earlier version.

May 01 '20 17:05 KnitCode

@vadym-t ping?

Jul 16 '20 12:07 pombredanne