[ja] `.filter` is used instead of `.map` for non-filter methods

Open mrorii opened this issue 2 years ago • 1 comments

On https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L64-L75 there are several cases where we are using .filter but instead it should be a .map.

For example https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L73 calls https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/prep/japanese_prep.py#L64-L67 but in effect this is doing nothing because the expression within .filter is always is true, as long as text is non-empty:

>>> def reduce_japanese_emoticon(text):
...     text = re.sub("w{3,}", "www", text)
...     text = re.sub("笑{2,}", "笑", text)
...     return text
>>> rdd = sc.parallelize([{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}, {'text': ''}])
>>> rdd.filter(lambda x: reduce_japanese_emoticon(x['text'])).collect()
[{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}]

Thus, I think the following cases of .filter are simply doing nothing instead of the intended preprocessing:

preprocess_text on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L70
reduce_japanese_emoticon on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L73
remove_symbols on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L75

The remaining calls to methods that end with _filter (e.g. japanese_bad_words_filter, doc_len_filter, etc.) are actually filter methods that return booleans so they should be OK.

Jul 14 '23 14:07 mrorii

Your analysis here is correct, and this is a problem. I'm working on a fix.

I'm also trying to figure out the Spark execution model in detail, but I believe this many calls to small functions is a problem. If I understand correctly, Spark may need to send data to the JVM separately for each filter call. If that's the case we can drastically reduce overhead by wrapping all the Python functions into a single function before using filter().

Jul 21 '23 09:07 polm-stability