evidently icon indicating copy to clipboard operation
evidently copied to clipboard

report now showing when using a bit more data

Open rmminusrslash opened this issue 4 years ago • 3 comments

Hey,

I wanted to run against a production dataset of small-mid size:

65 columns, 150K points in each dataframe.

If I reduce the dataset to one feature, the report shows. If I use all features, the report goes from 16MB to 600 MB and is not displaying (saved or in jupyter).

rmminusrslash avatar Aug 11 '21 17:08 rmminusrslash

Hey @rmminusrslash, Thanks for reporting! Unfortunately, this is the current limitation of the tool.

The report is large because the tool stores all the data necessary to generate interactive plots directly inside the HTML. We plan to fix it when we create a service version of the tool (where we decouple the data storage and the browser-based web service).

For now there are two workarounds:

  1. Use some sampling strategy for your dataset, for instance random sapling. For Jupyter notebook, that can be done directly with pandas. For command line interface, we have a configuration - you can choose random sampling or pick the n-th rows.
  2. Use JSON profile. This way, Evidently calculates the metrics and statistical tests but they can be logged or displayed elsewhere. We have an example for MLflow https://docs.evidentlyai.com/step-by-step-guides/integrations/evidently-+-mlflow and i am working now on one for Grafana.

We understand this limits how you can use the tool now, and are working hard to get to the more feature-full version!

emeli-dral avatar Aug 12 '21 18:08 emeli-dral

Hey @emeli-dral,

ah, I probably should have been more clear about what I was asking. I tried sampling when I figured out the root cause, up to 10K datapoints worked.

Would it make sense to

  • add sampling as the default if the dataset exceeds current limits (display a message that sampling happened)
  • if you decide against it, at least raise an unsupported exception that mentions the sampling option and mention the limitation in the docs

The current behavior of failing silently might not be ideal until you release the full version (unless you expect people to try the tool mostly with toy data)

rmminusrslash avatar Aug 16 '21 19:08 rmminusrslash

Hey @rmminusrslash , thanks for more details!

We thought about adding an error message based on data size. But the limit would depend on the user infrastructure especially if used locally, so it would be hard to set a universal threshold when sampling should be applied. And as a priority, we are also working right now to speed up the UI which should solve part of cases when reports are too large to display. Hopefully, it will help a lot 🤞

We are thinking about adding a flag later that the user can set on their own ("large dataset") which would then generate a variation of report that is best suited for larger datasets. It will include not only sampling but a different aggregated views for some parts of the report.

Agree on your comment of making the limitation for large datasets and sampling option even more clear for Jupyter notebook: we already added this now to the Quick-start part of the docs.

emeli-dral avatar Aug 20 '21 11:08 emeli-dral

Now reports by default do not use any raw data plots and this reduces reports size significantly

emeli-dral avatar Sep 21 '23 13:09 emeli-dral