Having trouble processing ICD codes
I am not using MIMIC-III or eicu data, and since this pipeline should e applicable to other EHR data sets, I am using it for in-house EHR data. No matter how I preprocess ICD codes e.g. ICD9:V50.2 vs V50.2 vs V502. I always encounter the error below:
--------------------------------------------------------------------------------
2-B) Transform time-dependent data
--------------------------------------------------------------------------------
Total variables : 31734
Traceback (most recent call last):
File "D:\bo\envs\bd\lib\site-packages\pandas\core\indexes\base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'icd_code:0'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
main()
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 138, in main
X, X_feature_names, X_feature_aliases = FIDDLE_steps.process_time_dependent(df_time_series, args)
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 235, in process_time_dependent
df_time_series, dtypes_time_series = transform_time_series_table(df_data_time_series, args)
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 430, in transform_time_series_table
variables_num_freq = get_frequent_numeric_variables(df_in, variables, theta_freq, args)
File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 93, in get_frequent_numeric_variables
numeric_vars = [col for col in variables if df_types[col] == 'Numeric']
File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 93, in <listcomp>
numeric_vars = [col for col in variables if df_types[col] == 'Numeric']
File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 942, in __getitem__
return self._get_value(key)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 1051, in _get_value
loc = self.index.get_loc(label)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'icd_code:0'
So my df_types only one icd related variable name icd_code which is correct. However the parse_variable_data_type process has made a whole new list of variable names with icd at the beginning. Thus why variables has a long list of "icd_code:*" elements. The whole process is very confusing and vague in details. Would you please enlighten me on the source of the error? Many thanks.
Or does it mean icd code cannot be time dependent variables? Surely they should be allowed?
Hello, I have just updated the code to fix this error. Please download the latest code from GitHub.
You may check out an example with data containing time-dependent ICD codes here. Please try to format your data according to this example.
Additionally, if the variable_name for your ICD code data is not "ICD9_CODE", you will need to change the following to your config file:
https://github.com/MLD3/FIDDLE/blob/86b197fc7ac3e6e90851e4bf01279156539aaee2/tests/icd_time_test/input/config-0.yaml#L4-L5
Many thanks @shengpu1126 ! Can I please confirm with you:
- I have noticed in your
icd_time_testexample, your icd code is a series of letters and numbers e.g.V502but yourhierarchical_sep: ':'. My diagnosis code contains one or more dots e.g.V50.2. Should I get rid of the dots or should I sethierarchical_sep: '.'ifhierarchical_sepis indeed used for this purpose? - My diagnosis codes contain some non-icd codes e.g.
DRG:389. Do you recommend I usehierarchicalorCategoricalasvalue_types?
As currently I can only use hierarchical_levels: [0]. If I set hierarchical_levels: [0, 1] the error below occurs even though I have different levels in my diagnosis codes.
================================================================================
2) Transform; 3) Post-filter
================================================================================
--------------------------------------------------------------------------------
*) Detecting and parsing value types
--------------------------------------------------------------------------------
Parsing hierarchical values
Traceback (most recent call last):
File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
main()
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 131, in main
df_data, df_types = FIDDLE_steps.parse_variable_data_type(df_data, args)
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 99, in parse_variable_data_type
df_hier_level[val_col] = df_hier_level[val_col].apply(lambda h: h[min(hier_level, len(h))])
File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 4357, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1043, in apply
return self.apply_standard()
File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1101, in apply_standard
convert=self.convert_dtype,
File "pandas\_libs\lib.pyx", line 2859, in pandas._libs.lib.map_infer
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 99, in <lambda>
df_hier_level[val_col] = df_hier_level[val_col].apply(lambda h: h[min(hier_level, len(h))])
IndexError: list index out of range
- I have noticed in your
icd_time_testexample, your icd code is a series of letters and numbers e.g.V502but yourhierarchical_sep: ':'. My diagnosis code contains one or more dots e.g.V50.2. Should I get rid of the dots or should I sethierarchical_sep: '.'ifhierarchical_sepis indeed used for this purpose?
There's built-in support for ICD9/ICD10 codes through icd9cms and icd10-cm packages, I believe both V50.2 and V502 should work. The : separator is for other types of hierarchical values that need to be preprocessed.
My diagnosis codes contain some non-icd codes e.g. DRG:389. Do you recommend I use hierarchical or Categorical as value_types?
I am less familiar with DRG codes. Does the DRG code of 389 have multiple levels? Similar to ICD9 code V502 having two levels V50 and V50.2. If not I think you may just treat it as a Categorical variable, for example:
| ID | t | variable_name | variable_value |
|---|---|---|---|
| XXX | 4 | DRG:1234 | 1 |
| XXX | 5 | ICD9_CODE | V502 |
Otherwise you should preprocess it and include the separator:
| ID | t | variable_name | variable_value |
|---|---|---|---|
| XXX | 4 | DRG_CODE | 12:34 |
| XXX | 5 | ICD9_CODE | V502 |
Many thanks, @shengpu1126 ! I have separated ICD 9 and 10 codes from the rest, and named each coding scheme uniquely, e.g.:
ICD9_CODE: hierarchical_ICD9
ICD10_CODE: hierarchical_ICD10
DRG_CODE: Categorical
DSM4_CODE: hierarchical
I then got the error below, which was strange since '645.03' is a legitimate ICD9 code that indicates "Prolonged pregnancy, antepartum condition or complication" in ICD9.
--------------------------------------------------------------------------------
*) Detecting and parsing value types
--------------------------------------------------------------------------------
Parsing hierarchical values
Traceback (most recent call last):
File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
main()
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 131, in main
df_data, df_types = FIDDLE_steps.parse_variable_data_type(df_data, args)
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 83, in parse_variable_data_type
df_var = df_var.apply(lambda s: map_icd_hierarchy(s, version=9))
File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 4357, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1043, in apply
return self.apply_standard()
File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1101, in apply_standard
convert=self.convert_dtype,
File "pandas\_libs\lib.pyx", line 2859, in pandas._libs.lib.map_infer
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 83, in <lambda>
df_var = df_var.apply(lambda s: map_icd_hierarchy(s, version=9))
File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 39, in map_icd_hierarchy
raise Exception("Invalid ICD code", s)
Exception: ('Invalid ICD code', '645.03')
I then removed the dots as mentioned earlier but the error stayed Exception: ('Invalid ICD code', '64503').
However changing
ICD9_CODE: hierarchical_ICD9
ICD10_CODE: hierarchical_ICD10
to
ICD9_CODE: hierarchical
ICD10_CODE: hierarchical
and switching back to codes that have the separator in them (hierarchical_sep: ".") worked.
I have though now encountered a new error:
--------------------------------------------------------------------------------
2-B) Transform time-dependent data
--------------------------------------------------------------------------------
Total variables : 771
Frequent variables : []
M₁ = 0
M₂ = 771
k = 3 ['min', 'max', 'mean']
Transforming each example...
0%| | 0/200 [00:00<?, ?it/s]10000377
Traceback (most recent call last):
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 371, in func_encode_single_time_series
df_j = pivot_event_table(g).reindex(columns=variables_non).sort_index()
File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 223, in pivot_event_table
df_dups.loc[df_v.index, t_col] += eps * np.arange(len(df_v))
File "D:\bo\envs\bd\lib\site-packages\pandas\core\generic.py", line 10964, in __iadd__
return self._inplace_method(other, type(self).__add__) # type: ignore[operator]
File "D:\bo\envs\bd\lib\site-packages\pandas\core\generic.py", line 10941, in _inplace_method
result = op(self, other)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\ops\common.py", line 69, in new_method
return method(self, other)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\arraylike.py", line 92, in __add__
return self._arith_method(other, operator.add)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 5526, in _arith_method
result = ops.arithmetic_op(lvalues, rvalues, op)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\ops\array_ops.py", line 224, in arithmetic_op
res_values = _na_arithmetic_op(left, right, op)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\ops\array_ops.py", line 166, in _na_arithmetic_op
result = func(left, right)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\computation\expressions.py", line 239, in evaluate
return _evaluate(op, op_str, a, b) # type: ignore[misc]
File "D:\bo\envs\bd\lib\site-packages\pandas\core\computation\expressions.py", line 69, in _evaluate_standard
return op(a, b)
ValueError: operands could not be broadcast together with shapes (121,) (11,)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
main()
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 138, in main
X, X_feature_names, X_feature_aliases = FIDDLE_steps.process_time_dependent(df_time_series, args)
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 236, in process_time_dependent
df_time_series, dtypes_time_series = transform_time_series_table(df_data_time_series, args)
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 462, in transform_time_series_table
for i, g in tqdm(grouped[:N])
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 462, in <genexpr>
for i, g in tqdm(grouped[:N])
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 391, in func_encode_single_time_series
raise Exception(i)
Exception: 10000377
2%|█▌ | 4/200 [00:00<00:15, 12.80it/s]
After some serious digging, I have found the error traceback to line 223 in the pivot_event_table function in helpers.py, which is used in line 371 in the func_encode_single_time_series function in steps.py. Its; because eps * np.arange(len(df_v)) has a lower dimension than df_dups.loc[df_v.index, t_col]. I have discovered in this particular data instance which is throwing this exception has the same var_name and var_value multiple times at the same t:
48989 10000377 2.476712 ERFV_CODE 160431
48990 10000377 2.515068 ERFV_CODE 122
48991 10000377 2.701370 ERFV_CODE 751
48992 10000377 2.701370 ERFV_CODE 751
48993 10000377 2.701370 ERFV_CODE 751
48994 10000377 2.701370 ERFV_CODE 751
48995 10000377 2.706849 ERFV_CODE 751
and in g (line 371 in the func_encode_single_time_series function) this looks like:
48989 10000377 2.476712 ERFV_CODE _160431
48990 10000377 2.515068 ERFV_CODE _122
48991 10000377 2.701370 ERFV_CODE:_751 1
48992 10000377 2.701370 ERFV_CODE:_751 1
48993 10000377 2.701370 ERFV_CODE:_751 1
48994 10000377 2.701370 ERFV_CODE:_751 1
Do you have any suggestions on how to deal with this situation pls? I am not sure what the 1s represent in val_col. Does it mean a number of occurrences? Why in some cases we have ERFV_CODE _122 but in some other cases ERFV_CODE:_751 1?
Hi,
The parser for ICD9/ICD10 relies on third-party packages that I do not have control of, so it is possible the dictionary they use is outdated and may be missing some of the codes. In that case, I agree with what you did which is to preprocess them by adding the separators.
As for the issue of duplicates, the pipeline was not designed to handle duplicates. This is because for most types of EHR data like vital signs, there should not be two different values for the same patient at one point in time. There are several things you could try that may help address the error you saw:
- Use pandas drop_duplicate function to remove duplicated rows that have the same [ID, t, variable_name] (the rows may have possibly different variable_values)
- Add a small constant to the timestamps (e.g. 0.00001) of the duplicated rows so every row has a different timestamp.
Many thanks @shengpu1126 !
Looking at the last example in my previous comment, can I please ask why you have different formats for var_name and var_value? e.g.
48989 10000377 2.476712 ERFV_CODE _160431
vs.
48991 10000377 2.701370 ERFV_CODE:_751 1
Or, e.g. reading the final df_X I have noticed two different ways of representing ERFV_CODE:160431:
ERFV_CODE_value__160431
vs
ERFV_CODE:_160431_value_1
Are they different in terms of how one should interpret them?
Also, what would ICD9_CODE_value_(1.999, 314.0] possibly represent?
Also, what would
ICD9_CODE_value_(1.999, 314.0]possibly represent?
This is likely because some ICD codes looks like numbers and python would interpret them as numbers unless we explicitly tell it these are strings. One workaround I usually use is to prepend an underscore "123" -> "_123" so they cannot be interpreted as numbers.