evidently icon indicating copy to clipboard operation
evidently copied to clipboard

Options ... and other implementation details

Open burkovae opened this issue 4 years ago • 2 comments

The case

The only time you use OptionProvider is in the Pipeline:

image

The Pipeline sets the same OptionsProvider for all analyzers.

class Pipeline:
    _analyzers: List[Type[Analyzer]]
    stages: Sequence[PipelineStage]
    analyzers_results: Dict[Type[Analyzer], object]
    options_provider: OptionsProvider

    def __init__(self, stages: Sequence[PipelineStage], options: list):
        self.stages = stages
        self.analyzers_results = {}
        self.options_provider = OptionsProvider()
        self._analyzers = list(itertools.chain.from_iterable([stage.analyzers() for stage in stages]))
        for option in options:
            self.options_provider.add(option)

def execute(self, ...):
        ...
        for analyzer in self.get_analyzers():
            instance = analyzer()
            instance.options_provider = self.options_provider
            self.analyzers_results[analyzer] = instance.calculate(reference_data, current_data, column_mapping)

Considering this, we can implement the following test case:

import unittest

from evidently.analyzers.base_analyzer import Analyzer
from evidently.options import DataDriftOptions, OptionsProvider


class TestAnalyzer(Analyzer):

    def calculate(self, reference_data, current_data, column_mapping):
        options = self.options_provider.get(DataDriftOptions)
        return options.stattest_func(reference_data, current_data)


class TestOptions(unittest.TestCase):

    def test_different_configurations_lead_to_different_computations(self):
        options_provider = OptionsProvider()
        options_provider.add(DataDriftOptions(stattest_func=lambda x, y: 3.14))
        analyzer_1 = TestAnalyzer()
        analyzer_1.options_provider = options_provider
        self.assertAlmostEqual(analyzer_1.calculate(None, None, None), 3.14)

        options_provider.add(DataDriftOptions(stattest_func=lambda x, y: 2.71))
        analyzer_2 = TestAnalyzer()
        analyzer_2.options_provider = options_provider
        self.assertAlmostEqual(analyzer_2.calculate(None, None, None), 2.71)

        # BUG
        self.assertAlmostEqual(analyzer_1.calculate(None, None, None), 3.14)

Basically there is no way you can set individual options at the moment. Unless it is intentional, I would consider this as bug.

The cause

Imho, I think there is some conceptual misunderstanding about class and instance attributes. In some cases in code it is somewhat confusing whether class or instance attributes are intended.

For example, Pipeline again,

class Pipeline:
    _analyzers: List[Type[Analyzer]]
    stages: Sequence[PipelineStage]
    analyzers_results: Dict[Type[Analyzer], object]
    options_provider: OptionsProvider

    def __init__(self, stages: Sequence[PipelineStage], options: list):
        self.stages = stages
        self.analyzers_results = {}
        self.options_provider = OptionsProvider()
        self._analyzers = list(itertools.chain.from_iterable([stage.analyzers() for stage in stages]))
        for option in options:
            self.options_provider.add(option)

All four class attributes are overwritten in the constructor, i.e. each new instance would overwrite class attributes for all instances. Take _analyzers for example: obviously (?) they intend to be instance attributes. But when you instantiate two pipelines with different analyzers, the first instance would get the same analyzers as the second one.

Why it does not break right now? Well, that is because Pipeline (or extensions thereof like Dashboard or Profile) is only used once per run/execution.

If for some reason anyone wants to instantiate more the one Dashboard or make a Dashboard with one set of analyzers and Profile with other set of analyzers, they would run on the same set of analyzers.

image

TL;DR

Long story short: do you think the usage of class attributes is appropriate Analyzer and Pipeline?

burkovae avatar Dec 12 '21 18:12 burkovae

Hi @burkovae,

About class attributes - we use instance attribute typehints. Here link to PEP-0526: https://www.python.org/dev/peps/pep-0526/#class-and-instance-variable-annotations

All four class attributes are overwritten in the constructor, i.e. each new instance would overwrite class attributes for all instances.

No, this code should never work this way, primarily in __init__ section we are setting those attributes for instance not for class (for class should be something like Analyzer.stages = ...).

About your bug: OptionsProvider is special construction which register single instance of options and passes it to all related components. So in your case - second call to analyzer1.calculate will see new options (in fact there should be exception if we try to put same option twice). It is intentional, because Options not only for altering behaviour of single Analyzer, but can be used by another Analyzer (so all analyzers which use same Options will see same parameters).

Example: We decided to introduce options for altering color scheme in dashboard. We create class ColorOptions with some fields. After, we use this options in all Widgets so they receive same ColorOptions passed in Pipeline once.

Liraim avatar Dec 12 '21 21:12 Liraim

Thank @Liraim , indeed I was not aware of PEP-0526. And I guess that confused me. Anyhow, I ran a test just to make sure, and yes, you are right, the instance attributes (and options) are different for each Pipeline in

class TestPipeline(unittest.TestCase):

    def test_pipeline_refer_to_different_analyzers(self):
        stage_1 = PipelineStage()
        stage_1.add_analyzer(TestAnalyzer)
        stage_2 = PipelineStage()
        stage_2.add_analyzer(TestAnalyzer2)

        pipeline_1 = Pipeline([stage_1], [DataDriftOptions(stattest_func=lambda x, y: 3.14)])
        pipeline_2 = Pipeline([stage_2], [DataDriftOptions(stattest_func=lambda x, y: 2.71)])

        self.assertTrue(pipeline_1.options_provider is not pipeline_2.options_provider)
        self.assertTrue(pipeline_1._analyzers != pipeline_2._analyzers)

So that is out of the way.

Still the issue with Options remains.

OptionsProvider is special construction which register single instance of options and passes it to all related components. So in your case - second call to analyzer1.calculate will see new options (in fact there should be exception if we try to put same option twice).

        options_provider = OptionsProvider()
        option = DataDriftOptions(stattest_func=lambda x, y: 3.14)
        options_provider.add(option)
        options_provider.add(option)

That does not raise any Exception. So something is not working as intended.

It is intentional, because Options not only for altering behavior of single Analyzer, but can be used by another Analyzer (so all analyzers which use same Options will see same parameters).

That would be very surprising for me (but also maybe for other users?). As you can tell from the test case, if anybody changes a statistical function in Options, that would change computation in all analyzers. Even in one Pipeline instance all supposedly different analyzers (with DataDriftOptions) would use the same statistical function. That would not be something that I expect.

I understand your example for colorization. But that is not implemented (I could not find ColorOptions) as far as I can tell.

Furthermore - and this is the use case I was exploring - using customized statistical function is a very good way to reuse existing code base and add other metrics. Specifically, by replacing the statistical function with, e.g. Kullback-Leibler divergence in NumTargetDriftAnalyzer you would easily add some other metrics. It would however break all the other analyzers that depend on DataDriftOptions if you put them into one pipeline.

burkovae avatar Dec 12 '21 23:12 burkovae

We have refactored Dashboards to Reports and instead of analyzers we use metrics now.

emeli-dral avatar Sep 21 '23 13:09 emeli-dral