Major-TOM icon indicating copy to clipboard operation
Major-TOM copied to clipboard

Repeated rows (grid cells) in metadata / metadata-image mismatches

Open LeungTsang opened this issue 1 year ago • 2 comments

I downloaded the Core-S2L2A data and found that there are 7197 repeated grid cells in the metadata. Some of the rows with the same grid cells are totally the same and some are different in other info like product id and cloud cover. However, all these corresponding rows in the parquet are exactly the same images with same product id, (no matter whether the metadata is different in other info or totally the same). Also, since the product id from the parquet is to name the image files and some overwriting happened, after downloaded all images using the provided script, I only got 2,245,886 - 7, 197 = 2,238,689 images.

It indicates that there were probably some mismatches when generating the datasets. It is fine for me to ignore these images but I want to confirm if other metadata and images are matched perfectly.

The first 30 pairs of the repeated grid cell rows in the metadata is showed below. Most of these paired rows are not identical but some are, for example rows 6174 and 6175 are totally the same. However, all pairs of these corresponding rows in the parquet have the same image content and image names (product id).

grid_cell  grid_row_u  grid_col_r  
667   917D_239R        -917         239   
668   917D_239R        -917         239   
1819   907D_75L        -907         -75   
1820   907D_75L        -907         -75   
1976   906D_58L        -906         -58   
1977   906D_58L        -906         -58   
2678   902D_38L        -902         -38   
2679   902D_38L        -902         -38   
2927  901D_224R        -901         224   
2928  901D_224R        -901         224   
3126  900D_227R        -900         227   
3127  900D_227R        -900         227   
3178  900D_305R        -900         305   
3179  900D_305R        -900         305   
3388  899D_315R        -899         315   
3389  899D_315R        -899         315   
3756  897D_278R        -897         278   
3757  897D_278R        -897         278   
5056  890D_332L        -890        -332   
5057  890D_332L        -890        -332   
5437  888D_325L        -888        -325   
5438  888D_325L        -888        -325   
5490  888D_156L        -888        -156   
5491  888D_156L        -888        -156   
5640  887D_317L        -887        -317   
5641  887D_317L        -887        -317   
5918   886D_61L        -886         -61   
5919   886D_61L        -886         -61   
6174  884D_339L        -884        -339   
6175  884D_339L        -884        -339   

product_id  
667   S2A_MSIL2A_20160116T193712_N0201_R113_T57CWJ_20160116T193710   
668   S2B_MSIL2A_20201023T174439_N0500_R069_T57CWJ_20230307T175341   
1819  S2B_MSIL2A_20211016T083939_N0301_R035_T23CMK_20211016T115409   
1820  S2B_MSIL2A_20221204T081929_N0400_R092_T23CMK_20221204T101225   
1976  S2B_MSIL2A_20220124T083939_N0301_R035_T25CDK_20220124T113510   
1977  S2B_MSIL2A_20200221T082939_N0500_R135_T25CDK_20230427T224239   
2678  S2B_MSIL2A_20221130T070229_N0400_R034_T27CVL_20221130T081600   
2679  S2B_MSIL2A_20191117T080929_N0500_R049_T27CVL_20230612T124130   
2927  S2A_MSIL2A_20200103T221841_N0500_R086_T52CDR_20230424T192507   
2928  S2B_MSIL2A_20191226T220859_N0500_R043_T52CDR_20230601T164232   
3126  S2B_MSIL2A_20210201T221839_N0500_R086_T52CDR_20230516T155514   
3127  S2B_MSIL2A_20200121T222829_N0500_R129_T52CDR_20230426T100350   
3178  S2B_MSIL2A_20200203T191459_N9999_R027_T59CNL_20230904T043145   
3179  S2B_MSIL2A_20181217T190459_N9999_R127_T59CNL_20230421T175343   
3388  S2B_MSIL2A_20200205T181509_N9999_R055_T60CVR_20230905T125206   
3389  S2B_MSIL2A_20201127T183459_N0500_R141_T60CVR_20230321T114502   
3756  S2B_MSIL2A_20221104T182459_N0400_R098_T56CMR_20221104T213737   
3757  S2B_MSIL2A_20210127T194529_N0500_R013_T56CMR_20230603T074316   
5056  S2B_MSIL2A_20181221T184459_N9999_R041_T02CNS_20230421T232504   
5057  S2A_MSIL2A_20160209T155812_N0201_R025_T02CNS_20160209T155809   
5437  S2B_MSIL2A_20221206T172409_N0509_R126_T03CWM_20221206T190829   
5438  S2B_MSIL2A_20230220T174429_N0509_R069_T03CWM_20230220T223723   
5490  S2B_MSIL2A_20210215T115259_N0500_R137_T17CNM_20230518T001413   
5491  S2B_MSIL2A_20201012T113309_N0500_R051_T17CNM_20230325T024812   
5640  S2B_MSIL2A_20211209T164349_N0301_R097_T04CES_20211209T201608   
5641  S2B_MSIL2A_20210201T171409_N0500_R083_T04CES_20230530T090343   
5918  S2B_MSIL2A_20211012T085959_N0301_R121_T25CEM_20211012T121006   
5919  S2B_MSIL2A_20220119T093009_N0301_R107_T25CEM_20220119T123235   
6174  S2B_MSIL2A_20201203T171359_N0500_R083_T03CVM_20230303T124834   
6175  S2B_MSIL2A_20201203T171359_N0500_R083_T03CVM_20230303T124834   

timestamp  cloud_cover  nodata  centre_lat  centre_lon  
667  2016-01-16 19:37:12     0.000000     1.0  -82.318521  161.746594   
668  2020-10-23 17:44:39     0.431957     0.0  -82.318521  161.746594   
1819 2021-10-16 08:39:39    15.424540     0.0  -81.422448  -45.075915   
1820 2022-12-04 08:19:29    23.754629     0.0  -81.422448  -45.075915   
1976 2022-01-24 08:39:39     0.000000     0.0  -81.333697  -34.436068   
1977 2020-02-21 08:29:39     0.000000     0.0  -81.333697  -34.436068   
2678 2022-11-30 07:02:29    22.641203     0.0  -80.973720  -21.563397   
2679 2019-11-17 08:09:29    15.894633     0.0  -80.973720  -21.563397   
2927 2020-01-03 22:18:41    20.046308     0.0  -80.884333  127.884393   
2928 2019-12-26 22:08:59     0.000000     0.0  -80.884333  127.884393   
3126 2021-02-01 22:18:39     0.000000     0.0  -80.794284  128.172593   
3127 2020-01-21 22:28:29    19.668094     0.0  -80.794284  128.172593   
3178 2020-02-03 19:14:59     2.530369     0.0  -80.792787  172.106859   
3179 2018-12-17 19:04:59     0.000000     0.0  -80.792787  172.106859   
3388 2020-02-05 18:15:09    22.548710     0.0  -80.704521  176.096834   
3389 2020-11-27 18:34:59     6.223085     0.0  -80.704521  176.096834   
3756 2022-11-04 18:24:59    12.573381     0.0  -80.524480  152.603947   
3757 2021-01-27 19:45:29    22.404491     0.0  -80.524480  152.603947   
5056 2018-12-21 18:44:59     0.000000     0.0  -79.894843 -170.246226   
5057 2016-02-09 15:58:12     0.000000     1.0  -79.894843 -170.246226   
5437 2022-12-06 17:24:09     9.481915     0.0  -79.714898 -163.848452   
5438 2023-02-20 17:44:29     7.681234     0.0  -79.714898 -163.848452   
5490 2021-02-15 11:52:59    20.495623     0.0  -79.713908  -78.524767   
5491 2020-10-12 11:33:09     0.003244     0.0  -79.713908  -78.524767   
5640 2021-12-09 16:43:49     2.377208     0.0  -79.625552 -158.472930   
5641 2021-02-01 17:14:09    14.692396     0.0  -79.625552 -158.472930   
5918 2021-10-12 08:59:59    19.077803     0.0  -79.533921  -30.054839   
5919 2022-01-19 09:30:09    21.544348     0.0  -79.533921  -30.054839   
6174 2020-12-03 17:13:59     6.117003     0.0  -79.356587 -165.121839   
6175 2020-12-03 17:13:59     6.117003     0.0  -79.356587 -165.121839   

crs  
667   EPSG:32757   
668   EPSG:32757   
1819  EPSG:32723   
1820  EPSG:32723   
1976  EPSG:32725   
1977  EPSG:32725   
2678  EPSG:32727   
2679  EPSG:32727   
2927  EPSG:32752   
2928  EPSG:32752   
3126  EPSG:32752   
3127  EPSG:32752   
3178  EPSG:32759   
3179  EPSG:32759   
3388  EPSG:32760   
3389  EPSG:32760   
3756  EPSG:32756   
3757  EPSG:32756   
5056  EPSG:32702   
5057  EPSG:32702   
5437  EPSG:32703   
5438  EPSG:32703   
5490  EPSG:32717   
5491  EPSG:32717   
5640  EPSG:32704   
5641  EPSG:32704   
5918  EPSG:32725   
5919  EPSG:32725   
6174  EPSG:32703   
6175  EPSG:32703   

parquet_url  
667   https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00002.parquet   
668   https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00002.parquet   
1819  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00004.parquet   
1820  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00004.parquet   
1976  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00004.parquet   
1977  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00004.parquet   
2678  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00006.parquet   
2679  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00006.parquet   
2927  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00006.parquet   
2928  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00006.parquet   
3126  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet   
3127  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet   
3178  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet   
3179  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet   
3388  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet   
3389  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet   
3756  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00008.parquet   
3757  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00008.parquet   
5056  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet   
5057  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet   
5437  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet   
5438  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet   
5490  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet   
5491  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet   
5640  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00012.parquet   
5641  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00012.parquet   
5918  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00012.parquet   
5919  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00012.parquet   
6174  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00013.parquet   
6175  https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00013.parquet   

parquet_row                  geometry  
667           167   POINT (161.747 -82.319)  
668           168   POINT (161.747 -82.319)  
1819          319   POINT (-45.076 -81.422)  
1820          320   POINT (-45.076 -81.422)  
1976          476   POINT (-34.436 -81.334)  
1977          477   POINT (-34.436 -81.334)  
2678          178   POINT (-21.563 -80.974)  
2679          179   POINT (-21.563 -80.974)  
2927          427   POINT (127.884 -80.884)  
2928          428   POINT (127.884 -80.884)  
3126          126   POINT (128.173 -80.794)  
3127          127   POINT (128.173 -80.794)  
3178          178   POINT (172.107 -80.793)  
3179          179   POINT (172.107 -80.793)  
3388          388   POINT (176.097 -80.705)  
3389          389   POINT (176.097 -80.705)  
3756          256   POINT (152.604 -80.524)  
3757          257   POINT (152.604 -80.524)  
5056           56  POINT (-170.246 -79.895)  
5057           57  POINT (-170.246 -79.895)  
5437          437  POINT (-163.848 -79.715)  
5438          438  POINT (-163.848 -79.715)  
5490          490   POINT (-78.525 -79.714)  
5491          491   POINT (-78.525 -79.714)  
5640          140  POINT (-158.473 -79.626)  
5641          141  POINT (-158.473 -79.626)  
5918          418   POINT (-30.055 -79.534)  
5919          419   POINT (-30.055 -79.534)  
6174          174  POINT (-165.122 -79.357)  
6175          175  POINT (-165.122 -79.357)

LeungTsang avatar Feb 01 '25 13:02 LeungTsang

A script to locate these rows.

import pyarrow.parquet as pq
import pandas as pd
import geopandas as gpd

from MajorTOM.metadata_helpers import metadata_from_url, filter_metadata, read_row, filter_download


local_url = 'metadata.parquet'
df = pq.read_table(local_url).to_pandas()
df['timestamp'] = pd.to_datetime(df.timestamp)
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.centre_lon, df.centre_lat), crs=df.crs.iloc[0])
print(len(gdf))

this_repeat = 1
max_repeat = 1
repeat = []
last_metadata = None
for index, metadata in gdf.iterrows():
    if last_metadata is not None and last_metadata['grid_cell'] == metadata['grid_cell']:
        this_repeat = this_repeat + 1
    else:
        if this_repeat > 1:
            repeat = repeat + [index-this_repeat+i for i in range(this_repeat)]
        if this_repeat > max_repeat:
            max_repeat = this_repeat
        last_metadata = metadata
        this_repeat = 1
if this_repeat > 1:
    repeat = repeat + [len(gdf)-this_repeat+i for i in range(this_repeat)]
if this_repeat > max_repeat:
    max_repeat = this_repeat
print(len(repeat), max_repeat)

repeated_gdf = gdf.iloc[repeat]

with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'max_colwidth', None):
    print(repeated_gdf.head(30))

LeungTsang avatar Feb 01 '25 13:02 LeungTsang

Hi @LeungTsang - apologies for the lag with the replies here, Major TOM is now a bit of a side project since the two main authors have been busy launching a new lab (https://asterisk.coop/).

I will try my best to investigate these files soon and maybe update the corresponding files. It's not clear to me why this error happened during creation of the dataset. This will also mean that some parquets will have fewer than 500 files, but I think in most cases that shouldn't cause errors in users' scripts (as long as this number isn't hardcoded anywhere).

mikonvergence avatar Apr 08 '25 15:04 mikonvergence