Repeated rows (grid cells) in metadata / metadata-image mismatches
I downloaded the Core-S2L2A data and found that there are 7197 repeated grid cells in the metadata. Some of the rows with the same grid cells are totally the same and some are different in other info like product id and cloud cover. However, all these corresponding rows in the parquet are exactly the same images with same product id, (no matter whether the metadata is different in other info or totally the same). Also, since the product id from the parquet is to name the image files and some overwriting happened, after downloaded all images using the provided script, I only got 2,245,886 - 7, 197 = 2,238,689 images.
It indicates that there were probably some mismatches when generating the datasets. It is fine for me to ignore these images but I want to confirm if other metadata and images are matched perfectly.
The first 30 pairs of the repeated grid cell rows in the metadata is showed below. Most of these paired rows are not identical but some are, for example rows 6174 and 6175 are totally the same. However, all pairs of these corresponding rows in the parquet have the same image content and image names (product id).
grid_cell grid_row_u grid_col_r
667 917D_239R -917 239
668 917D_239R -917 239
1819 907D_75L -907 -75
1820 907D_75L -907 -75
1976 906D_58L -906 -58
1977 906D_58L -906 -58
2678 902D_38L -902 -38
2679 902D_38L -902 -38
2927 901D_224R -901 224
2928 901D_224R -901 224
3126 900D_227R -900 227
3127 900D_227R -900 227
3178 900D_305R -900 305
3179 900D_305R -900 305
3388 899D_315R -899 315
3389 899D_315R -899 315
3756 897D_278R -897 278
3757 897D_278R -897 278
5056 890D_332L -890 -332
5057 890D_332L -890 -332
5437 888D_325L -888 -325
5438 888D_325L -888 -325
5490 888D_156L -888 -156
5491 888D_156L -888 -156
5640 887D_317L -887 -317
5641 887D_317L -887 -317
5918 886D_61L -886 -61
5919 886D_61L -886 -61
6174 884D_339L -884 -339
6175 884D_339L -884 -339
product_id
667 S2A_MSIL2A_20160116T193712_N0201_R113_T57CWJ_20160116T193710
668 S2B_MSIL2A_20201023T174439_N0500_R069_T57CWJ_20230307T175341
1819 S2B_MSIL2A_20211016T083939_N0301_R035_T23CMK_20211016T115409
1820 S2B_MSIL2A_20221204T081929_N0400_R092_T23CMK_20221204T101225
1976 S2B_MSIL2A_20220124T083939_N0301_R035_T25CDK_20220124T113510
1977 S2B_MSIL2A_20200221T082939_N0500_R135_T25CDK_20230427T224239
2678 S2B_MSIL2A_20221130T070229_N0400_R034_T27CVL_20221130T081600
2679 S2B_MSIL2A_20191117T080929_N0500_R049_T27CVL_20230612T124130
2927 S2A_MSIL2A_20200103T221841_N0500_R086_T52CDR_20230424T192507
2928 S2B_MSIL2A_20191226T220859_N0500_R043_T52CDR_20230601T164232
3126 S2B_MSIL2A_20210201T221839_N0500_R086_T52CDR_20230516T155514
3127 S2B_MSIL2A_20200121T222829_N0500_R129_T52CDR_20230426T100350
3178 S2B_MSIL2A_20200203T191459_N9999_R027_T59CNL_20230904T043145
3179 S2B_MSIL2A_20181217T190459_N9999_R127_T59CNL_20230421T175343
3388 S2B_MSIL2A_20200205T181509_N9999_R055_T60CVR_20230905T125206
3389 S2B_MSIL2A_20201127T183459_N0500_R141_T60CVR_20230321T114502
3756 S2B_MSIL2A_20221104T182459_N0400_R098_T56CMR_20221104T213737
3757 S2B_MSIL2A_20210127T194529_N0500_R013_T56CMR_20230603T074316
5056 S2B_MSIL2A_20181221T184459_N9999_R041_T02CNS_20230421T232504
5057 S2A_MSIL2A_20160209T155812_N0201_R025_T02CNS_20160209T155809
5437 S2B_MSIL2A_20221206T172409_N0509_R126_T03CWM_20221206T190829
5438 S2B_MSIL2A_20230220T174429_N0509_R069_T03CWM_20230220T223723
5490 S2B_MSIL2A_20210215T115259_N0500_R137_T17CNM_20230518T001413
5491 S2B_MSIL2A_20201012T113309_N0500_R051_T17CNM_20230325T024812
5640 S2B_MSIL2A_20211209T164349_N0301_R097_T04CES_20211209T201608
5641 S2B_MSIL2A_20210201T171409_N0500_R083_T04CES_20230530T090343
5918 S2B_MSIL2A_20211012T085959_N0301_R121_T25CEM_20211012T121006
5919 S2B_MSIL2A_20220119T093009_N0301_R107_T25CEM_20220119T123235
6174 S2B_MSIL2A_20201203T171359_N0500_R083_T03CVM_20230303T124834
6175 S2B_MSIL2A_20201203T171359_N0500_R083_T03CVM_20230303T124834
timestamp cloud_cover nodata centre_lat centre_lon
667 2016-01-16 19:37:12 0.000000 1.0 -82.318521 161.746594
668 2020-10-23 17:44:39 0.431957 0.0 -82.318521 161.746594
1819 2021-10-16 08:39:39 15.424540 0.0 -81.422448 -45.075915
1820 2022-12-04 08:19:29 23.754629 0.0 -81.422448 -45.075915
1976 2022-01-24 08:39:39 0.000000 0.0 -81.333697 -34.436068
1977 2020-02-21 08:29:39 0.000000 0.0 -81.333697 -34.436068
2678 2022-11-30 07:02:29 22.641203 0.0 -80.973720 -21.563397
2679 2019-11-17 08:09:29 15.894633 0.0 -80.973720 -21.563397
2927 2020-01-03 22:18:41 20.046308 0.0 -80.884333 127.884393
2928 2019-12-26 22:08:59 0.000000 0.0 -80.884333 127.884393
3126 2021-02-01 22:18:39 0.000000 0.0 -80.794284 128.172593
3127 2020-01-21 22:28:29 19.668094 0.0 -80.794284 128.172593
3178 2020-02-03 19:14:59 2.530369 0.0 -80.792787 172.106859
3179 2018-12-17 19:04:59 0.000000 0.0 -80.792787 172.106859
3388 2020-02-05 18:15:09 22.548710 0.0 -80.704521 176.096834
3389 2020-11-27 18:34:59 6.223085 0.0 -80.704521 176.096834
3756 2022-11-04 18:24:59 12.573381 0.0 -80.524480 152.603947
3757 2021-01-27 19:45:29 22.404491 0.0 -80.524480 152.603947
5056 2018-12-21 18:44:59 0.000000 0.0 -79.894843 -170.246226
5057 2016-02-09 15:58:12 0.000000 1.0 -79.894843 -170.246226
5437 2022-12-06 17:24:09 9.481915 0.0 -79.714898 -163.848452
5438 2023-02-20 17:44:29 7.681234 0.0 -79.714898 -163.848452
5490 2021-02-15 11:52:59 20.495623 0.0 -79.713908 -78.524767
5491 2020-10-12 11:33:09 0.003244 0.0 -79.713908 -78.524767
5640 2021-12-09 16:43:49 2.377208 0.0 -79.625552 -158.472930
5641 2021-02-01 17:14:09 14.692396 0.0 -79.625552 -158.472930
5918 2021-10-12 08:59:59 19.077803 0.0 -79.533921 -30.054839
5919 2022-01-19 09:30:09 21.544348 0.0 -79.533921 -30.054839
6174 2020-12-03 17:13:59 6.117003 0.0 -79.356587 -165.121839
6175 2020-12-03 17:13:59 6.117003 0.0 -79.356587 -165.121839
crs
667 EPSG:32757
668 EPSG:32757
1819 EPSG:32723
1820 EPSG:32723
1976 EPSG:32725
1977 EPSG:32725
2678 EPSG:32727
2679 EPSG:32727
2927 EPSG:32752
2928 EPSG:32752
3126 EPSG:32752
3127 EPSG:32752
3178 EPSG:32759
3179 EPSG:32759
3388 EPSG:32760
3389 EPSG:32760
3756 EPSG:32756
3757 EPSG:32756
5056 EPSG:32702
5057 EPSG:32702
5437 EPSG:32703
5438 EPSG:32703
5490 EPSG:32717
5491 EPSG:32717
5640 EPSG:32704
5641 EPSG:32704
5918 EPSG:32725
5919 EPSG:32725
6174 EPSG:32703
6175 EPSG:32703
parquet_url
667 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00002.parquet
668 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00002.parquet
1819 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00004.parquet
1820 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00004.parquet
1976 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00004.parquet
1977 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00004.parquet
2678 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00006.parquet
2679 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00006.parquet
2927 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00006.parquet
2928 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00006.parquet
3126 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet
3127 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet
3178 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet
3179 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet
3388 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet
3389 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00007.parquet
3756 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00008.parquet
3757 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00008.parquet
5056 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet
5057 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet
5437 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet
5438 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet
5490 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet
5491 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00011.parquet
5640 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00012.parquet
5641 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00012.parquet
5918 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00012.parquet
5919 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00012.parquet
6174 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00013.parquet
6175 https://huggingface.co/datasets/Major-TOM/Core-S2L2A/resolve/main/images/part_00013.parquet
parquet_row geometry
667 167 POINT (161.747 -82.319)
668 168 POINT (161.747 -82.319)
1819 319 POINT (-45.076 -81.422)
1820 320 POINT (-45.076 -81.422)
1976 476 POINT (-34.436 -81.334)
1977 477 POINT (-34.436 -81.334)
2678 178 POINT (-21.563 -80.974)
2679 179 POINT (-21.563 -80.974)
2927 427 POINT (127.884 -80.884)
2928 428 POINT (127.884 -80.884)
3126 126 POINT (128.173 -80.794)
3127 127 POINT (128.173 -80.794)
3178 178 POINT (172.107 -80.793)
3179 179 POINT (172.107 -80.793)
3388 388 POINT (176.097 -80.705)
3389 389 POINT (176.097 -80.705)
3756 256 POINT (152.604 -80.524)
3757 257 POINT (152.604 -80.524)
5056 56 POINT (-170.246 -79.895)
5057 57 POINT (-170.246 -79.895)
5437 437 POINT (-163.848 -79.715)
5438 438 POINT (-163.848 -79.715)
5490 490 POINT (-78.525 -79.714)
5491 491 POINT (-78.525 -79.714)
5640 140 POINT (-158.473 -79.626)
5641 141 POINT (-158.473 -79.626)
5918 418 POINT (-30.055 -79.534)
5919 419 POINT (-30.055 -79.534)
6174 174 POINT (-165.122 -79.357)
6175 175 POINT (-165.122 -79.357)
A script to locate these rows.
import pyarrow.parquet as pq
import pandas as pd
import geopandas as gpd
from MajorTOM.metadata_helpers import metadata_from_url, filter_metadata, read_row, filter_download
local_url = 'metadata.parquet'
df = pq.read_table(local_url).to_pandas()
df['timestamp'] = pd.to_datetime(df.timestamp)
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.centre_lon, df.centre_lat), crs=df.crs.iloc[0])
print(len(gdf))
this_repeat = 1
max_repeat = 1
repeat = []
last_metadata = None
for index, metadata in gdf.iterrows():
if last_metadata is not None and last_metadata['grid_cell'] == metadata['grid_cell']:
this_repeat = this_repeat + 1
else:
if this_repeat > 1:
repeat = repeat + [index-this_repeat+i for i in range(this_repeat)]
if this_repeat > max_repeat:
max_repeat = this_repeat
last_metadata = metadata
this_repeat = 1
if this_repeat > 1:
repeat = repeat + [len(gdf)-this_repeat+i for i in range(this_repeat)]
if this_repeat > max_repeat:
max_repeat = this_repeat
print(len(repeat), max_repeat)
repeated_gdf = gdf.iloc[repeat]
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'max_colwidth', None):
print(repeated_gdf.head(30))
Hi @LeungTsang - apologies for the lag with the replies here, Major TOM is now a bit of a side project since the two main authors have been busy launching a new lab (https://asterisk.coop/).
I will try my best to investigate these files soon and maybe update the corresponding files. It's not clear to me why this error happened during creation of the dataset. This will also mean that some parquets will have fewer than 500 files, but I think in most cases that shouldn't cause errors in users' scripts (as long as this number isn't hardcoded anywhere).