aind-data-schema icon indicating copy to clipboard operation
aind-data-schema copied to clipboard

Enum-like classes cant be used as dictionary keys

Open bruno-f-cruz opened this issue 1 year ago • 13 comments

Not sure if there is a different intended use...

To reproduce:

from aind_data_schema_models.platforms import Platform
from pydantic import BaseModel


class Foo(BaseModel):
    platform: dict[Platform.ONE_OF, str]


foo = Foo(platform={Platform.BEHAVIOR: "foo"})
print(foo)

bar = foo.model_dump()

bruno-f-cruz avatar Jun 03 '24 21:06 bruno-f-cruz

Tracking https://github.com/pydantic/pydantic/issues/9453

bruno-f-cruz avatar Jun 03 '24 23:06 bruno-f-cruz

This is actually more serious that I initially gave it credit for. This pattern completely prevents users from using the model_dump method:

import examples
from examples.fip_ophys_rig import rig
from aind_data_schema_models.modalities import Modality
rig.modalities = set([Modality.BEHAVIOR])
rig.model_dump()

raises the following error:

raceback (most recent call last):
  File ".\aind-data-schema\deserialization_order.py", line 27, in <module>     
    rig.model_dump()
  File ".\aind-data-schema\.venv\Lib\site-packages\pydantic\main.py", line 314, in model_dump
    return self.__pydantic_serializer__.to_python(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unhashable type: 'dict'

bruno-f-cruz avatar Jun 07 '24 15:06 bruno-f-cruz

It seems like an issue with the pydantic serializer. Even this causes issues:

from typing import Set
from pydantic import BaseModel, ConfigDict
from dataclasses import dataclass

@dataclass(frozen=True)
class Cell:
    value: str

# This works
set([Cell(value="a"), Cell(value="b")])

class WeightedSeeds(BaseModel):
    seeds: Set[Cell]

foo = WeightedSeeds(seeds=set([Cell(value="a"), Cell(value="b")]))

# This raises unhashable type errors
foo.model_dump()

jtyoung84 avatar Jun 07 '24 16:06 jtyoung84

Possible solutions

1- Simplify everything and go back to native Enum. This is the preferred solution as it is rather isomorphic across programming languages and json.

2- write our own de(serializers) for a base class to be inherited from all complex Enum types. This strategy has a few problems:

  • We would need to serialize the class a a string (likely using the name property`) and deserialize by looking into the allowed types and finding the correct class to deserialize to. This means that all types must be discriminated unions, with the discriminator being the same as the property being serialized.
  • One must ensure that all properties of all objects that inherit from this base class are Literal and read-only (i.e. provided at construction type). Otherwise any sort of custom deserialization will inevitably be lossy;

3- Keep a service somewhere that takes a string and returns these complex objects. This is potentially very interesting as it would allow the schema to be decoupled from the specifics of this database. For instance, the schema would only have to keep track of all possible manufacturers without having to keep track of abbreviations, numeric IDs and other properties that might otherwise change with time.

bruno-f-cruz avatar Jun 07 '24 18:06 bruno-f-cruz

@saskiad @dyf looping you in too. I think this is a rather annoying regression bug (since these started off as an Enum). Especially when other AIND tools like the watchdog could benefit greatly from this change.

I will still look into the second solution but it just doesn't feel right. We should really just make all these Enums. It would also solve the weird pattern of typing the properties with Type.ONE_OF but using a different instance (Type.Value).

bruno-f-cruz avatar Jun 08 '24 02:06 bruno-f-cruz

Enums would be nice. I can't recall anymore exactly what the issue was that made me switch, but enums of objects were creating headaches somewhere. I do remember now that when we upgraded to v2, my original enum metadata class broke. I went with this solution. I'm open to suggestions for improvements since this has always been troublesome.

I also remember now too, that most of the docs I was reading were using tagged unions for this kind of situation, but not quite the same situation.

jtyoung84 avatar Jun 08 '24 02:06 jtyoung84

@jtyoung84 It's late and I might be missing something very obvious, but what about this?

from aind_data_schema_models.modalities import Modality

from pydantic import BaseModel, Field
from enum import Enum
from typing import Dict, Set


class Modalities(Enum):
    BEHAVIOR = Modality.BEHAVIOR
    OPHYS = Modality.POPHYS
    ICEPHYS = Modality.ICEPHYS


class SubSetOfModalities(Enum):
    BEHAVIOR = Modality.BEHAVIOR
    OPHYS = Modality.POPHYS


class Foo(BaseModel):
    modality: Dict[Modalities, str] = Field(default={Modalities.BEHAVIOR: "foo"})
    stuff: Set[Modalities] = Field(default={Modalities.BEHAVIOR, Modalities.OPHYS})
    another_subset: SubSetOfModalities = Field(default=SubSetOfModalities.BEHAVIOR, validate_default=True)


foo = Foo()
print(foo)
print(foo.model_dump())

This is just a proof of concept, we could improve a bit the subtyping I think. I think it would be so much cleaner for users too!

To be clear, I still think Enum of strings would be far easier to maintain.

bruno-f-cruz avatar Jun 08 '24 05:06 bruno-f-cruz

I have been playing with this a bit more, I think we can make this change be fully backward compatible with users code. The trick would be to define the subsets of Modalities as Literals (This is the python standard in most libraries anyway, so i think it is actually a good idea...)

The previous example would thus become:

from aind_data_schema_models.modalities import Modality

from pydantic import BaseModel, Field
from enum import Enum
from typing import Dict, Set, Literal


class Modalities(Enum):
    BEHAVIOR = Modality.BEHAVIOR
    OPHYS = Modality.POPHYS
    ICEPHYS = Modality.ICEPHYS


class Foo(BaseModel):
    modality: Dict[Modalities, str] = Field(default={Modalities.BEHAVIOR: "foo"})
    stuff: Set[Modalities] = Field(default={Modalities.BEHAVIOR, Modalities.OPHYS})
    another_subset: Literal[Modalities.BEHAVIOR, Modalities.OPHYS] = Field(..., validate_default=True)


foo = Foo(another_subset=Modalities.BEHAVIOR)  # type hint: "another_subset: Literal[Modalities.BEHAVIOR, Modalities.OPHYS]"
print(foo)
print(foo.model_dump())

bruno-f-cruz avatar Jun 12 '24 17:06 bruno-f-cruz

Ignore the previous two suggestions. I dont think we want to go down this route as both the json-schema and round-trip (de)serialization breaks.

print(foo.model_validate_json(foo.model_dump_json()))

leads to:

pydantic_core._pydantic_core.ValidationError: 4 validation errors for Foo
modality.name='Behavior' abbreviation='behavior'.[key]
  Input should be Behavior(name='Behavior', abbreviation='behavior'), POphys(name='Planar optical physiology', abbreviation='ophys') or Icephys(name='Intracellular electrophysiology', abbreviation='icephys') [type=enum, input_value="name='Behavior' abbreviation='behavior'", input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/enum
stuff.0
  Input should be Behavior(name='Behavior', abbreviation='behavior'), POphys(name='Planar optical physiology', abbreviation='ophys') or Icephys(name='Intracellular electrophysiology', abbreviation='icephys') [type=enum, input_value={'name': 'Planar optical ...'abbreviation': 'ophys'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/enum
stuff.1
  Input should be Behavior(name='Behavior', abbreviation='behavior'), POphys(name='Planar optical physiology', abbreviation='ophys') or Icephys(name='Intracellular electrophysiology', abbreviation='icephys') [type=enum, input_value={'name': 'Behavior', 'abbreviation': 'behavior'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/enum
another_subset
  Input should be <Modalities.BEHAVIOR: Behavior(name='Behavior', abbreviation='behavior')> or <Modalities.OPHYS: POphys(name='Planar optical physiology', abbreviation='ophys')> [type=literal_error, input_value={'name': 'Behavior', 'abbreviation': 'behavior'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/literal_error

It seems pydantic cant automatically convert the json object instance to the corresponding python class. This could be done with an extra validator, but I still think we should go with Enum of type string to keep the solution simple.

bruno-f-cruz avatar Jun 13 '24 18:06 bruno-f-cruz

@bruno-f-cruz I'm going to assign this to @dbirman to see if this is a still an issue.

jtyoung84 avatar Oct 07 '24 17:10 jtyoung84

Okay so revisiting this, I think the codegen -> enum classes is the solution here, yes? If so, let's close this issue and I'll look into getting that done this sprint.

dbirman avatar Oct 08 '24 02:10 dbirman

I dont think sgen alone will fix it since the pattern would still be the same. The problem rests with using classes as enum values unfort. As I mentioned in the sgen pr:

I should preface this PR by saying that the underlying pattern of these enum-like classes is not ideal and smells a bit like an anti-pattern to me. Mainly because of https://github.com/AllenNeuralDynamics/aind-data-schema/issues/960

Here's some more material relevant to the discussion

https://github.com/AllenNeuralDynamics/aind-data-schema-models/pull/64 https://github.com/AllenNeuralDynamics/aind-data-schema-models/issues/73

bruno-f-cruz avatar Oct 08 '24 13:10 bruno-f-cruz

Okay so I take it your other PRs although they reference this issue are not in fact fixing this problem they are just allowing you to use Union[], but you still can't use a straight dictionary anywhere because pydantic classes can't be used as keys and they break model_dump (since it dumps to a dictionary).

So the only real fix here would be to generate e.g. the abbreviations of the models as enums and store their data in a separate class?

dbirman avatar Oct 08 '24 14:10 dbirman

Coming back to this now that the codegen is in place and thinking about future upgrades. I think having duplicate classes (one enum and one the data) is not a great solution, it will inevitably cause confusion. Is it an issue to use model.abbreviation (or name, etc) as the dictionary keys? What's the reason to want to use the real model classes other than simplicity?

dbirman avatar Nov 19 '24 05:11 dbirman

I haven't look at the current way you are using the Modalities and their abbreviations so I may be missing something:

  • I think you summed it up pretty well with the simplicity argument. I feel using modalities as an abbreviation is a very non intuitive thing to have as a user. Especially given the type hint provided. You either use string, Modality with custom validator or an StrEnum that duplicates the current class. At this point I think it would be better to flip the strategy around. Go with value type Enums that have methods that create/reference the full literal classes instead. This may be a suitable alternative as long as the literal class information does not need to live in the json-schema;

  • using an abbreviation, without generating a duplicated enum, does not affect the compiled json-schema since the static type never exists, it is simply a string without json-schema validation.

  • Validation is also not supported from pydantic unless you create a custom type or validator. Eg checking if the corresponding class exists;

  • operationally, it also creates a situation that boils down to the same thing you just listed as wanting to avoid: duplicating references to the same concept. You will end up with people typing modalities as strings (when using abbreviations) and others using it as the static type.

Once again, I may be missing a simple alternative, but either way would be nice to document it for future use! Thanks for looking into this.

bruno-f-cruz avatar Nov 19 '24 06:11 bruno-f-cruz

Ok here's another argument to use simple enumerators. I used the codegen to replace all literal models with something that uses a string instead:

class MouseAnatomicalStructure:
    """MouseAnatomicalStructure"""

    ANATOMICAL_STRUCTURE = "_Anatomical_Structure"
    FIRST_POLAR_BODY = "_First_Polar_Body"
.... (Enter many lines here)

I was curious what the overhead to cold generate all these models actually is. After running both versions I got:

Enum of strings: Time taken to import: 0.39705514907836914 seconds
Enum of literal models: Time taken to import: 23.41841435432434 seconds

This would be a huge quality of life improvement since, currently, the sole reason our pipeline launcher takes 30secs to bootup is the aind-data-schema import that we need for a single step of the processing (i.e. create the session.json).

I also tested if the construction of the unions was part of the problem, but it appears that skipping that process barely shaves any time off the import. I believe the overhead is the class creation.

bruno-f-cruz avatar Nov 21 '24 17:11 bruno-f-cruz

The load times are addressed by this draft PR: https://github.com/AllenNeuralDynamics/aind-data-schema-models/issues/80. This doesn't provide the enum behavior you want though because the strings aren't real they are just there to support the external validator which constructs and returns the real object.

I'm leaning towards encouraging developers to use abbreviations where this is needed and/or having developers build the enum class downstream when they need it. So we would never use this pattern in the core aind-data-schema, but we can encourage people to use it in their own codebases.

dbirman avatar Nov 21 '24 17:11 dbirman

As an alternative, to make things backwards compatible, we can try something like:

from pydantic import BaseModel, Field, ConfigDict
from enum import Enum
from typing import Union, Optional

class BaseName(BaseModel):
    model_config = ConfigDict(extra="forbid", use_enum_values=True)
    name: str = Field(..., title="Name")
    abbreviation: Optional[str] = Field(default=None, title="Abbreviation")

class _Modality(BaseName):
    model_config = ConfigDict(frozen=True)
    name: str
    abbreviation: str

class Modality(str, Enum):
    BEHAVIOR="behavior"
    ECEPHYS="ecephys"
    @classmethod
    def from_abbreviation(cls, abbr: str):
       return cls(abbr)
    @classmethod
    def from_name(cls, name: str):
       return cls[name]

class Session(BaseModel):
    modality: Union[Modality, _Modality] = Field(...)

my_session = Session(modality=Modality.ECEPHYS)
print(my_session.model_dump_json())
{"modality":"ecephys"}

# We can keep the _Modality models in a db.
# We'll lose the ability for local validation. 
# Users can hack their own _Modality objects, and we won't be able to check until later.
# I think most users just treat things as enums though.

my_session2 = Session(modality={"name":"Extracellular electrophysiology", "abbreviation":"ecephys"})
print(my_session2.model_dump_json())
{"modality":{"name":"Extracellular electrophysiology","abbreviation":"ecephys"}}

jtyoung84 avatar Nov 21 '24 18:11 jtyoung84

Both of these solutions are inline with:

3- Keep a service somewhere that takes a string and returns these complex objects. This is potentially very interesting as it would allow the schema to be decoupled from the specifics of this database. For instance, the schema would only have to keep track of all possible manufacturers without having to keep track of abbreviations, numeric IDs and other properties that might otherwise change with time.

I think this is a good way to go about it. My online issue with both of these solutions is the json-schema that gets generated.

@dbirman I am not being able to generate a valid json-schema from your branch using pydantic @jtyoung84 The Union you are using is not discrminated. If this is not a problem for you, that's potentially ok, but I have had a fair amount of problems automatically generating code and UIs with these kind of annontations.

bruno-f-cruz avatar Nov 21 '24 18:11 bruno-f-cruz

Both of these solutions are inline with:

3- Keep a service somewhere that takes a string and returns these complex objects. This is potentially very interesting as it would allow the schema to be decoupled from the specifics of this database. For instance, the schema would only have to keep track of all possible manufacturers without having to keep track of abbreviations, numeric IDs and other properties that might otherwise change with time.

I think this is a good way to go about it. My online issue with both of these solutions is the json-schema that gets generated.

@dbirman I am not being able to generate a valid json-schema from your branch using pydantic @jtyoung84 The Union you are using is not discrminated. If this is not a problem for you, that's potentially ok, but I have had a fair amount of problems automatically generating code and UIs with these kind of annontations.

The Union is comparing the str Enum type and a generic model. It would only be an issue if the _Modality model was a string or a string Enum.

jtyoung84 avatar Nov 21 '24 18:11 jtyoung84

{
   "$defs": {
      "Modality": {
         "enum": [
            "behavior",
            "ecephys"
         ],
         "title": "Modality",
         "type": "string"
      },
      "_Modality": {
         "additionalProperties": false,
         "properties": {
            "name": {
               "title": "Name",
               "type": "string"
            },
            "abbreviation": {
               "title": "Abbreviation",
               "type": "string"
            }
         },
         "required": [
            "name",
            "abbreviation"
         ],
         "title": "_Modality",
         "type": "object"
      }
   },
   "properties": {
      "modality": {
         "anyOf": [
            {
               "$ref": "#/$defs/Modality"
            },
            {
               "$ref": "#/$defs/_Modality"
            }
         ],
         "title": "Modality"
      }
   },
   "required": [
      "modality"
   ],
   "title": "Session",
   "type": "object"
}

jtyoung84 avatar Nov 21 '24 18:11 jtyoung84

Both of these solutions are inline with:

3- Keep a service somewhere that takes a string and returns these complex objects. This is potentially very interesting as it would allow the schema to be decoupled from the specifics of this database. For instance, the schema would only have to keep track of all possible manufacturers without having to keep track of abbreviations, numeric IDs and other properties that might otherwise change with time.

I think this is a good way to go about it. My online issue with both of these solutions is the json-schema that gets generated.

@dbirman I am not being able to generate a valid json-schema from your branch using pydantic @jtyoung84 The Union you are using is not discrminated. If this is not a problem for you, that's potentially ok, but I have had a fair amount of problems automatically generating code and UIs with these kind of annontations.

In what context are you generating the json-schema for the models?

dbirman avatar Nov 21 '24 19:11 dbirman

via https://github.com/AllenNeuralDynamics/aind-data-schema-models/tree/82cfc0a2b7748d6c0a239a01e97318d77f40c189

from aind_data_schema.base import AindCoreModel
from aind_data_schema_models import mouse_anatomy
from typing import Literal


class Foo(AindCoreModel):
    body_part: mouse_anatomy.MouseEmgMuscles

print(Foo.model_json_schema())

bruno-f-cruz avatar Nov 21 '24 19:11 bruno-f-cruz

via https://github.com/AllenNeuralDynamics/aind-data-schema-models/tree/82cfc0a2b7748d6c0a239a01e97318d77f40c189

from aind_data_schema.base import AindCoreModel
from aind_data_schema_models import mouse_anatomy
from typing import Literal


class Foo(AindCoreModel):
    body_part: mouse_anatomy.MouseEmgMuscles

print(Foo.model_json_schema())

Yeah no I understand the code that would do this, I'm more asking why in general do you need the schema file for this? Like where is this going to be used downstream?

The *Model will generate a schema without issues and that seems more important. The MouseAnatomy class is in some sense just a wrapper for convenience.

dbirman avatar Nov 21 '24 19:11 dbirman

You mean in aind-data-schemas? https://github.com/AllenNeuralDynamics/aind-data-schema/blob/95e3882a0b185bcb2e97a98b04f0eb4af0eaf8e0/src/aind_data_schema/core/procedures.py#L564

bruno-f-cruz avatar Nov 21 '24 20:11 bruno-f-cruz

So in those situations it would be replaced by:

muscle: MouseAnatomyModel

That's just a regular pydantic model so the schemas should get generated without issue

dbirman avatar Nov 21 '24 20:11 dbirman

Oh I see but then you are dropping the support for autocompletion and type hinting that you just added with the generators?

bruno-f-cruz avatar Nov 21 '24 21:11 bruno-f-cruz

You get the autocompletion from the classes that derive from the MouseAnatomyMeta class, so e.g. you can do:

class Model(BaseModel):
  structure: MouseAnatomyModel = Field(default=MouseAnatomy.ANATOMICAL_STRUCTURE)

Or when generating models: Model(structure = MouseEmgMuscles.DELTOID)

dbirman avatar Nov 21 '24 22:11 dbirman

I see, you are ok with having a mismatch between the type of the field and the type of the instance. Ok I was missing that detail. Thanks for clarifying! image

I thought you were trying to have something that could use the syntax:

prop: EnumType = EnumType.Value

bruno-f-cruz avatar Nov 21 '24 22:11 bruno-f-cruz

As an alternative, to make things backwards compatible, we can try something like:

from pydantic import BaseModel, Field, ConfigDict
from enum import Enum
from typing import Union, Optional

class BaseName(BaseModel):
    model_config = ConfigDict(extra="forbid", use_enum_values=True)
    name: str = Field(..., title="Name")
    abbreviation: Optional[str] = Field(default=None, title="Abbreviation")

class _Modality(BaseName):
    model_config = ConfigDict(frozen=True)
    name: str
    abbreviation: str

class Modality(str, Enum):
    BEHAVIOR="behavior"
    ECEPHYS="ecephys"
    @classmethod
    def from_abbreviation(cls, abbr: str):
       return cls(abbr)
    @classmethod
    def from_name(cls, name: str):
       return cls[name]

class Session(BaseModel):
    modality: Union[Modality, _Modality] = Field(...)

my_session = Session(modality=Modality.ECEPHYS)
print(my_session.model_dump_json())
{"modality":"ecephys"}

# We can keep the _Modality models in a db.
# We'll lose the ability for local validation. 
# Users can hack their own _Modality objects, and we won't be able to check until later.
# I think most users just treat things as enums though.

my_session2 = Session(modality={"name":"Extracellular electrophysiology", "abbreviation":"ecephys"})
print(my_session2.model_dump_json())
{"modality":{"name":"Extracellular electrophysiology","abbreviation":"ecephys"}}

I don't see value in complicating the schema when we can encourage people to use .abbreviation. I would support:

  • In aind-data-schema as developers we ensure that modality fields are typed to ModalityModel (or whatever it's called) and in general in code we maintain we avoid using model types as keys
  • In user code they are welcome to use dictionaries wherever they want, and we encourage them to use the .abbreviation field for that purpose

dbirman avatar Nov 21 '24 22:11 dbirman