Create table format version constants
Feature Request / Improvement
Define constants for the table format version, for V1, V2, and future V3 table versions.
Many places refer to the constant 1 and 2 as table versions. This will become harder to track and maintain as we add more features that are specific to a table version.
Example: https://github.com/search?q=repo%3Aapache%2Ficeberg-python+path%3Atable%2Fmetadata.py+%22%3D+1%22&type=code
Hey @kevinjqliu, Can I pick up this issue.
Sure thing, please let me know if you have any questions
Could define a BaseVersion class or something that downstream classes could inherit from in order to dish out versions of functions as expected/store constants? Might be too much to start with and using constants is simpler and enables a bigger refactor later
@harsha-mangena are you still working on this ? If not, can I work on this ?
@tanmayrauth this seem stale, please feel free to work on it
Thanks @kevinjqliu. What I understood that I shall create an enum declaration for (v1, v2, v3.. ) and get it replaced everywhere in code where we use 1, 2, 3... for table version. Is this understanding correct ?
@tanmayrauth Yes! The enum will be easier to work with. There are a lot of raw string comparisons like this one https://github.com/search?q=repo%3Aapache%2Ficeberg-python+path%3Atable%2Fmetadata.py+%22%3D+1%22&type=code
@kevinjqliu I found this TableVersion declaration already present.
Shall we convert this to Enum ? The only challenge will be to make sure that at each place where number (1,2) is passed we will have to change it to TableVersion.V1/v2 else there will be type error. Even at places where someone is calling pyiceberg library from outside.
If not the above approach then I can create an another Enum like TableFormatVersion and handle the original ask.
Even at places where someone is calling pyiceberg library from outside.
Looks like TableVersion is just a type declaration.
https://github.com/search?q=repo%3Aapache%2Ficeberg-python%20TableVersion&type=code
I think it would be ok to change it to an Enum.
cc @sungwy do you have an opinion here?
So, I shall convert it to an Enum, right?
Yep I think that should be fine
Hi Kevin, I was hoping I could take up this issue.
@willcollins10 assigned to you
@kevinjqliu can i work on this issue ? I have already noted that we need an Enum and just replace format version everywhere with it?
@iyad-f sure, i dont think anyone else is working on it.
just replace format version everywhere with it?
yea i think its more readable (and maintainable) to replace these constants https://grep.app/search?f.path=pyiceberg%2Ftable%2F&f.repo=apache%2Ficeberg-python&f.repo.pattern=iceberg-python&q=%3D%3D+3
@kevinjqliu I would have to replace it in classes of TableMetadata too right?, for example for TableMetadataV1 format_version would now be something like format_version: TableVersion = Field(alias="format-version", default=TableVersion.V1)
also wherever format_version is used as a value would have to do format_version.value, or would have to go the other way around for example for DATA_FILE_TYPE: Dict[int, StructType] it would be now DATA_FILE_TYPE: Dict[TableVersion, StructType]
hm, good point.
https://github.com/apache/iceberg-python/blob/f195dadeb85b1c59f2cd85f80802344cc0842386/pyiceberg/table/metadata.py#L500
would be something like
class FormatVersion(int, Enum):
V1 = 1
V2 = 2
V3 = 3
class MyModel(BaseModel):
format_version: FormatVersion = Field(alias="format-version", default=FormatVersion.V2)
@Fokko WDYT? Do you think this is something we'd want to do?
naming the class TableVersion can be a bit consistent with what it was already, but at the same time error prone too if not noticed properly, since type checking is also turned off, i think naming it FormatVersion will be a better choice
just a quick search for format_version in the library tells me it would need to be changed almost everywhere from tests to actual library code. If you have decided on whether to move forward with that design or not let me know, so that i can start working on it
@kevinjqliu So are we going with this approach, should i start working on this or wait for @Fokko's recommendation?
@iyad-f I think we can start it and see where it takes us. its a good exercise to see where we use version numbers. esp now that we need to add V3
@kevinjqliu ok i will be starting this now, and will make a PR soon for it
Sorry for being late here, my mailbox is overflowing a bit.
The earlier example of:
DATA_FILE_TYPE: Dict[int, StructType]
is wrong, and should be:
DATA_FILE_TYPE: Dict[TableVersion, StructType]
What's the benefit we get from turning this into an Enum? It seems more verbose to me. Also, things like:
from pyiceberg.typedef import FormatVersion
tbl.upgrade_table_version(FormatVersion.V3)
Which is not very intuitive to the end users.
Hello @Fokko,
The original PR statement addresses your question. It was noted that in numerous locations, we refer to the constants 1 and 2 as table versions, which will become increasingly difficult to track and maintain in the future. I concur with this assessment, and therefore, I propose that we transition this to enums only. @kevinjqliu would appreciate your input on this matter as well.
I'll go ahead and close this issue since we have a TypeAlias today:
https://github.com/apache/iceberg-python/blob/ccaa15cf8ac1eb1202256d17efc9cf9d20723d20/pyiceberg/typedef.py#L210
Sorry for the misunderstanding, but I think this is more pythonic than an Enum.