dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Let `clean_url` support modifying the default list of `remove_auth`

Open NoirTree opened this issue 4 years ago • 0 comments

Currently, remove_auth in clean_url supports two scenarios: 1. remove tokens in the default list, 2. remove tokens in the union of default list and the list user provided. However, there might be cases that users only want to remove tokens they designated, or they want to keep some tokens in the default list while removing the rest of them. Since users cannot modify the default list of remove_auth, these requirements cannot be satisfies at the moment.

Here is an example from the document, and I use it to show the scenarios that cannot be satisfied. The requirements are listed in the comments.

df = pd.DataFrame({
    "url": ["http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc&not_token=hiThere&another_token=12323423"]
})

df_default = clean_url(df, column="url", split = True, report = False)
print(df_default['queries'][0])

# Only want to remove "another_token", but fail. "auth" and "token" are removed, too.
df_trial1 = clean_url(df, column="url", remove_auth=["another_token"], split = True, report = False)
print(df_trial1['queries'][0])

# Only want to remove "auth", but fail. "token" is removed, too.
df_trial2 = clean_url(df, column="url", remove_auth=["auth"], split = True, report = False)
print(df_trial2['queries'][0])

image

I wonder if we can solve the problem without changing clean_url? If not, maybe we need to make it support modifying default list of remove_auth?

NoirTree avatar May 16 '21 01:05 NoirTree