iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

feat: delete orphaned files

Open jayceslesar opened this issue 9 months ago • 9 comments

Closes #1200

Rationale for this change

Ability to do more table maintenance from pyiceberg (iceberg-python?)

Are these changes tested?

Added a test!

Are there any user-facing changes?

Yes, this is a new method on the Table class.

jayceslesar avatar Apr 29 '25 22:04 jayceslesar

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

kevinjqliu avatar May 04 '25 01:05 kevinjqliu

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

I think that makes sense -- would https://github.com/apache/iceberg-python/pull/1880 end up there too?

Also ideally there is a CLI that exposes all the maintenance actions too right?

I think moving things to a new OptimizeTable class in a new namespace optimize.py makes a lot of sense, can be modeled very similar to the InspectTable and generally makes things cleaner -- I think it still makes sense to have the all_known_files inside of inspect though, and can still use that in the new OptimizeTable

jayceslesar avatar May 04 '25 16:05 jayceslesar

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

That's a good point. However, I think we should be able to either run them separate as well. For example, delete orphan files won't affect the speed of the table, so it is more of a maintenance feature to reduce object storage costs. Delete orphan files can also be pretty costly because of the list operation, ideally you would delegate this to the catalog that uses, for example, s3 inventory.

Fokko avatar May 13 '25 14:05 Fokko

@Fokko we probably also want pyiceberg to have some idea about https://iceberg.apache.org/spec/#delete-formats right? Is it currently aware of those files?

jayceslesar avatar Jun 24 '25 12:06 jayceslesar

@jayceslesar I believe the merge-on-read delete files (positional deletes, equality deletes, and deletion vectors) are returned by the all-files. The only part that's missing is the partition statistics files.

Fokko avatar Jun 24 '25 14:06 Fokko

@jayceslesar I believe the merge-on-read delete files (positional deletes, equality deletes, and deletion vectors) are returned by the all-files. The only part that's missing is the partition statistics files.

Sounds good, I will add the partition statistics files when that is merged!

jayceslesar avatar Jun 24 '25 15:06 jayceslesar

Once issue I've found with this PR is that the catalog properties need to propagate to PyArrowFileIO(properties=...) otherwise endpoint/authentication/etc to things like s3 simply fail ...

aammar5 avatar Jul 10 '25 15:07 aammar5

Going to get around adding tests for both types of FileIO... @Fokko @kevinjqliu anything else you think we need here?

jayceslesar avatar Sep 22 '25 21:09 jayceslesar

@jayceslesar how's this coming? Let me know if i can help with anything. Id like to use this in prod as well!

ForeverAngry avatar Nov 10 '25 15:11 ForeverAngry