created a more user-friendly error message when bad data is found
Notify: @Bergvca
Here I suggest a solution to one user's problem (see https://github.com/Bergvca/string_grouper/pull/43#issuecomment-824591895). It was a bit more difficult to implement than I thought. :)
import random
import string
from datetime import datetime
import pandas as pd
import numpy as np
from string_grouper import compute_pairwise_similarities
Create a Series with a few random strings:
strings = [ ''.join(random.choices(string.ascii_uppercase + string.digits, k=10)) for i in range(20) ]
good_series = pd.Series(strings, name='left')
good_series.to_frame()
| left | |
|---|---|
| 0 | 6P1UMBC8D8 |
| 1 | ONWZTJ53E1 |
| 2 | TO7AADMIAD |
| 3 | 6Y1QDGIKZ5 |
| 4 | J53R2HZI96 |
| 5 | Q383BO2VLK |
| 6 | 0KINOSJ5JU |
| 7 | J8AHSMJNOE |
| 8 | IZL32I7VPC |
| 9 | 9RHVQHA0N3 |
| 10 | XUVDL96FDL |
| 11 | M7ROKPJ2IQ |
| 12 | MNXWZHRBPJ |
| 13 | 1QSN3KG4DM |
| 14 | UW9EC83LDH |
| 15 | DHZLAQHUWI |
| 16 | M6HP4FH88Z |
| 17 | CNMKI44QWZ |
| 18 | DCVVKSSUO7 |
| 19 | 27B9P0B68L |
Generate another Series of strings with some bad (non-string or empty string) values:
bad_series = pd.Series(
random.choices(
[None, np.nan, "", datetime.now()]*5 +
strings +
[i for i in range(111, 115)]
, k=20
),
name='right'
).rename_axis('id')
bad_series.to_frame()
| right | |
|---|---|
| id | |
| 0 | MNXWZHRBPJ |
| 1 | M6HP4FH88Z |
| 2 | 1QSN3KG4DM |
| 3 | |
| 4 | None |
| 5 | 2021-05-09 12:27:18.736565 |
| 6 | 2021-05-09 12:27:18.736565 |
| 7 | 2021-05-09 12:27:18.736565 |
| 8 | DCVVKSSUO7 |
| 9 | MNXWZHRBPJ |
| 10 | 27B9P0B68L |
| 11 | IZL32I7VPC |
| 12 | UW9EC83LDH |
| 13 | 112 |
| 14 | MNXWZHRBPJ |
| 15 | 1QSN3KG4DM |
| 16 | None |
| 17 | None |
| 18 | None |
| 19 | NaN |
Notice the error message after the traceback log:
compute_pairwise_similarities(good_series, bad_series)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-56153281113f> in <module>
----> 1 compute_pairwise_similarities(good_series, bad_series)
~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in this(*args, **kwargs)
61 # function "this" in the first parameter position
62 def this(*args, **kwargs):
---> 63 return func(this, *args, **kwargs)
64 return this
65
~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in compute_pairwise_similarities(this, string_series_1, string_series_2, **kwargs)
86 this.issues = sg.issues
87 this.issues.rename(f'Non-strings in Series {sname}', inplace=True)
---> 88 raise TypeError(sg.error_msg(sname, 'compute_pairwise_similarities'))
89 return sg.dot()
90
TypeError:
ERROR: Input pandas Series 'right' (string_series_2) contains values that are not strings!
Display the pandas Series 'compute_pairwise_similarities.issues' to find where these values are:
Non-strings in Series 'right' (string_series_2)
id
3
4 None
5 2021-05-09 12:27:18.736565
6 2021-05-09 12:27:18.736565
7 2021-05-09 12:27:18.736565
13 112
16 None
17 None
18 None
19 NaN
compute_pairwise_similarities.issues
id
3
4 None
5 2021-05-09 12:27:18.736565
6 2021-05-09 12:27:18.736565
7 2021-05-09 12:27:18.736565
13 112
16 None
17 None
18 None
19 NaN
Name: Non-strings in Series 'right' (string_series_2), dtype: object
Similar functionality exists for the other high-level functions: group_similar_strings(), match_most_similar() and match_strings()
Hi @Bergvca
Just noticed you merged the other PR. If you intend to merge the next two, perhaps it would be best to start with this one as it has fewer changes than the other.
:)
ok thanks, will do :)