iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Create table format version constants

Open kevinjqliu opened this issue 1 year ago • 25 comments

Feature Request / Improvement

Define constants for the table format version, for V1, V2, and future V3 table versions.

Many places refer to the constant 1 and 2 as table versions. This will become harder to track and maintain as we add more features that are specific to a table version.

Example: https://github.com/search?q=repo%3Aapache%2Ficeberg-python+path%3Atable%2Fmetadata.py+%22%3D+1%22&type=code

kevinjqliu avatar Jun 24 '24 18:06 kevinjqliu

Hey @kevinjqliu, Can I pick up this issue.

harsha-mangena avatar Jun 24 '24 19:06 harsha-mangena

Sure thing, please let me know if you have any questions

kevinjqliu avatar Jun 24 '24 19:06 kevinjqliu

Could define a BaseVersion class or something that downstream classes could inherit from in order to dish out versions of functions as expected/store constants? Might be too much to start with and using constants is simpler and enables a bigger refactor later

jayceslesar avatar Jul 12 '24 01:07 jayceslesar

@harsha-mangena are you still working on this ? If not, can I work on this ?

tanmayrauth avatar Sep 15 '24 22:09 tanmayrauth

@tanmayrauth this seem stale, please feel free to work on it

kevinjqliu avatar Sep 16 '24 00:09 kevinjqliu

Thanks @kevinjqliu. What I understood that I shall create an enum declaration for (v1, v2, v3.. ) and get it replaced everywhere in code where we use 1, 2, 3... for table version. Is this understanding correct ?

tanmayrauth avatar Sep 16 '24 05:09 tanmayrauth

@tanmayrauth Yes! The enum will be easier to work with. There are a lot of raw string comparisons like this one https://github.com/search?q=repo%3Aapache%2Ficeberg-python+path%3Atable%2Fmetadata.py+%22%3D+1%22&type=code

kevinjqliu avatar Sep 16 '24 16:09 kevinjqliu

@kevinjqliu I found this TableVersion declaration already present.

Shall we convert this to Enum ? The only challenge will be to make sure that at each place where number (1,2) is passed we will have to change it to TableVersion.V1/v2 else there will be type error. Even at places where someone is calling pyiceberg library from outside.

If not the above approach then I can create an another Enum like TableFormatVersion and handle the original ask.

tanmayrauth avatar Sep 18 '24 05:09 tanmayrauth

Even at places where someone is calling pyiceberg library from outside.

Looks like TableVersion is just a type declaration. https://github.com/search?q=repo%3Aapache%2Ficeberg-python%20TableVersion&type=code

I think it would be ok to change it to an Enum.

cc @sungwy do you have an opinion here?

kevinjqliu avatar Sep 18 '24 16:09 kevinjqliu

So, I shall convert it to an Enum, right?

tanmayrauth avatar Sep 20 '24 05:09 tanmayrauth

Yep I think that should be fine

kevinjqliu avatar Sep 20 '24 05:09 kevinjqliu

Hi Kevin, I was hoping I could take up this issue.

willcollins10 avatar Nov 25 '24 22:11 willcollins10

@willcollins10 assigned to you

kevinjqliu avatar Nov 26 '24 18:11 kevinjqliu

@kevinjqliu can i work on this issue ? I have already noted that we need an Enum and just replace format version everywhere with it?

iyad-f avatar Feb 03 '25 16:02 iyad-f

@iyad-f sure, i dont think anyone else is working on it.

just replace format version everywhere with it?

yea i think its more readable (and maintainable) to replace these constants https://grep.app/search?f.path=pyiceberg%2Ftable%2F&f.repo=apache%2Ficeberg-python&f.repo.pattern=iceberg-python&q=%3D%3D+3

kevinjqliu avatar Feb 03 '25 21:02 kevinjqliu

@kevinjqliu I would have to replace it in classes of TableMetadata too right?, for example for TableMetadataV1 format_version would now be something like format_version: TableVersion = Field(alias="format-version", default=TableVersion.V1)

iyad-f avatar Feb 03 '25 21:02 iyad-f

also wherever format_version is used as a value would have to do format_version.value, or would have to go the other way around for example for DATA_FILE_TYPE: Dict[int, StructType] it would be now DATA_FILE_TYPE: Dict[TableVersion, StructType]

iyad-f avatar Feb 03 '25 21:02 iyad-f

hm, good point.

https://github.com/apache/iceberg-python/blob/f195dadeb85b1c59f2cd85f80802344cc0842386/pyiceberg/table/metadata.py#L500

would be something like

class FormatVersion(int, Enum):
    V1 = 1
    V2 = 2
    V3 = 3

class MyModel(BaseModel):
    format_version: FormatVersion = Field(alias="format-version", default=FormatVersion.V2)

@Fokko WDYT? Do you think this is something we'd want to do?

kevinjqliu avatar Feb 03 '25 21:02 kevinjqliu

naming the class TableVersion can be a bit consistent with what it was already, but at the same time error prone too if not noticed properly, since type checking is also turned off, i think naming it FormatVersion will be a better choice

iyad-f avatar Feb 03 '25 22:02 iyad-f

just a quick search for format_version in the library tells me it would need to be changed almost everywhere from tests to actual library code. If you have decided on whether to move forward with that design or not let me know, so that i can start working on it

iyad-f avatar Feb 03 '25 22:02 iyad-f

@kevinjqliu So are we going with this approach, should i start working on this or wait for @Fokko's recommendation?

iyad-f avatar Feb 04 '25 14:02 iyad-f

@iyad-f I think we can start it and see where it takes us. its a good exercise to see where we use version numbers. esp now that we need to add V3

kevinjqliu avatar Feb 06 '25 04:02 kevinjqliu

@kevinjqliu ok i will be starting this now, and will make a PR soon for it

iyad-f avatar Feb 11 '25 16:02 iyad-f

Sorry for being late here, my mailbox is overflowing a bit.

The earlier example of:

DATA_FILE_TYPE: Dict[int, StructType]

is wrong, and should be:

DATA_FILE_TYPE: Dict[TableVersion, StructType]

What's the benefit we get from turning this into an Enum? It seems more verbose to me. Also, things like:

from pyiceberg.typedef import FormatVersion

tbl.upgrade_table_version(FormatVersion.V3)

Which is not very intuitive to the end users.

Fokko avatar Feb 11 '25 19:02 Fokko

Hello @Fokko,

The original PR statement addresses your question. It was noted that in numerous locations, we refer to the constants 1 and 2 as table versions, which will become increasingly difficult to track and maintain in the future. I concur with this assessment, and therefore, I propose that we transition this to enums only. @kevinjqliu would appreciate your input on this matter as well.

iyad-f avatar Mar 15 '25 02:03 iyad-f

I'll go ahead and close this issue since we have a TypeAlias today:

https://github.com/apache/iceberg-python/blob/ccaa15cf8ac1eb1202256d17efc9cf9d20723d20/pyiceberg/typedef.py#L210

Sorry for the misunderstanding, but I think this is more pythonic than an Enum.

Fokko avatar Aug 04 '25 11:08 Fokko