SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

ICEExplainer returns same feature importance

Open akshat-suwalka-dream11 opened this issue 2 years ago • 10 comments

SynapseML version

Version: 0.11.0

System information

10.4 LTS ML (includes Apache Spark 3.2.1, Scala 2.12) com.microsoft.azure:synapseml_2.12:0.10.1 pyspark in databricks

Describe the problem

In my randomforesstclassification model, which is pyspark model....all the features are numerical..

the ouptut

Code to reproduce issue

pdp_1 = ICETransformer( model=model_object_1, targetCol="probability", kind="average", targetClasses=[1], numericFeatures=[{"name": "pd1_amount_join", "numSplits": 50, "rangeMin": 0.0, "rangeMax": 400000.0}] #convert -290 to -1 )

output_pdp_1 = pdp_1.transform(features_1.filter(features_1.days_inactive == 0)) display(output_pdp_1)

#Below is the code which is showing error df_userid_1 = get_pandas_df_from_column(output_pdp_1, "pd1_amount_join_dependence") plot_dependence_for_numeric(df_userid_1, "pd1_amount_join")

Other info / logs

1st display result -> {"264000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "0.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "400000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "80000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "336000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "56000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "32000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "384000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "24000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "152000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "72000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "248000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "160000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "176000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "200000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "296000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "368000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "376000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "168000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "64000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "184000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "240000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "88000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "360000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "320000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "256000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "352000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "136000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "8000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "312000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "16000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "192000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "216000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "232000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "272000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "104000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "392000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "224000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "128000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "288000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "344000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "208000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "40000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "96000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "280000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "112000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "48000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "144000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "304000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "328000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "120000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}}

2nd error ->

/databricks/spark/python/pyspark/sql/pandas/conversion.py:92: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Unable to convert the field 104000.0. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion. Direct cause: Unsupported type in conversion to Arrow: VectorUDT Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) ValueError: invalid literal for int() with base 10: '104000.0'

What component(s) does this bug affect?

  • [ ] area/cognitive: Cognitive project
  • [X] area/core: Core project
  • [ ] area/deep-learning: DeepLearning project
  • [ ] area/lightgbm: Lightgbm project
  • [ ] area/opencv: Opencv project
  • [ ] area/vw: VW project
  • [X] area/website: Website
  • [ ] area/build: Project build system
  • [X] area/notebooks: Samples under notebooks folder
  • [ ] area/docker: Docker usage
  • [X] area/models: models related issue

What language(s) does this bug affect?

  • [ ] language/scala: Scala source code
  • [X] language/python: Pyspark APIs
  • [ ] language/r: R APIs
  • [ ] language/csharp: .NET APIs
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/synapse: Azure Synapse integrations
  • [ ] integrations/azureml: Azure ML integrations
  • [X] integrations/databricks: Databricks integrations

akshat-suwalka-dream11 avatar Mar 29 '23 12:03 akshat-suwalka-dream11

Hey @akshat-suwalka-dream11 :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

github-actions[bot] avatar Mar 29 '23 12:03 github-actions[bot]

@mhamilton723

akshat-suwalka-dream11 avatar Apr 03 '23 10:04 akshat-suwalka-dream11

I'll investigate.

memoryz avatar Apr 06 '23 07:04 memoryz

@akshat-suwalka-dream11 , can you modify the plot_dependence_for_numeric function to this and see if it works:

def plot_dependence_for_numeric(df, col, col_int=True, figsize=(20, 5)):
    dict_values = {}
    col_names = list(df.columns)

    for col_name in col_names:
        dict_values[col_name] = df[col_name][0].toArray()[0]
        marklist = sorted(
            dict_values.items(), key=lambda x: int(float(x[0])) if col_int else x[0]
        )
        sortdict = dict(marklist)

    fig = plt.figure(figsize=figsize)

    plt.plot(list(sortdict.keys()), list(sortdict.values()))

    plt.xlabel(col, size=13)
    plt.ylabel("Dependence")
    plt.ylim(0.0)
    plt.show()

memoryz avatar Apr 08 '23 21:04 memoryz

@memoryz Thank you for the reply... It is solving this problem which is to plot. But i am seeing for every column and their every bucket there is only one constant value is coming ....like in above 0.34720682012802506. One may say ok this feature is not important thats why it is showing constant value but i saw that for every feature it is the same value...Now this is a problemastic

akshat-suwalka-dream11 avatar Apr 10 '23 08:04 akshat-suwalka-dream11

@akshat-suwalka-dream11 can you attach a screenshot of what you're seeing? I'm not sure if I understand what the problem is.

memoryz avatar Apr 10 '23 20:04 memoryz

@memoryz Screenshot 2023-04-14 at 3 21 09 PM

akshat-suwalka-dream11 avatar Apr 14 '23 09:04 akshat-suwalka-dream11

Screenshot 2023-04-14 at 3 22 02 PM

akshat-suwalka-dream11 avatar Apr 14 '23 09:04 akshat-suwalka-dream11

for every single column have this type of data

akshat-suwalka-dream11 avatar Apr 14 '23 09:04 akshat-suwalka-dream11

@memoryz

akshat-suwalka-dream11 avatar Apr 23 '23 08:04 akshat-suwalka-dream11