`compact_files` raises OSError: LanceError(IO): Execution error: Row ids did not arrive in sorted order: integers are ordered up to the 0th element`
In [57]: dataset.optimize.compact_files(max_bytes_per_file=1024*1024*256, batch_size=1024, num_threads=100)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Cell In[57], line 1
----> 1 dataset.optimize.compact_files(max_bytes_per_file=1024*1024*256, batch_size=1024, num_threads=100)
File ~/anaconda3/lib/python3.10/site-packages/lance/dataset.py:2624, in DatasetOptimizer.compact_files(self, target_rows_per_fragment, max_rows_per_group, max_bytes_per_file, materialize_deletions, materialize_deletions_threshold, num_threads, batch_size)
2562 """Compacts small files in the dataset, reducing total number of files.
2563
2564 This does a few things:
(...)
2613 lance.optimize.Compaction
2614 """
2615 opts = dict(
2616 target_rows_per_fragment=target_rows_per_fragment,
2617 max_rows_per_group=max_rows_per_group,
(...)
2622 batch_size=batch_size,
2623 )
-> 2624 return Compaction.execute(self._dataset, opts)
OSError: LanceError(IO): Execution error: Row ids did not arrive in sorted order: integers are ordered up to the 0th element, /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/task/poll.rs:288:44
Let me know what other debug information I can provide
I haven't seen this one in a long while. @wjones127 did stable row ids change anything here you think?
Can you dump your fragment info?
import lance.debug
print(lance.debug.format_manifest(dataset))
Also, can you run validate?
dataset.validate()
Manifest: https://gist.github.com/tonyf/45c36a6c20b8b633964ca5295bf7be10
In [7]: dataset.validate()
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Cell In[7], line 1
----> 1 dataset.validate()
File ~/workspace/datasets/.venv/lib/python3.10/site-packages/lance/dataset.py:1820, in LanceDataset.validate(self)
1813 def validate(self):
1814 """
1815 Validate the dataset.
1816
1817 This checks the integrity of the dataset and will raise an exception if
1818 the dataset is corrupted.
1819 """
-> 1820 self._ds.validate()
OSError: Encountered corrupt file lance/images.lance: Duplicate fragment id 302279 found in dataset Path { raw: "lance/images.lance" }, /home/runner/work/lance/lance/rust/lance/src/dataset.rs:1227:21
Hmm this sounds like an earlier thread where someone else got duplicate fragment ids: https://discord.com/channels/1030247538198061086/1273015298970095708
I provided a snippet that can undo the duplication of fragment ids: https://discord.com/channels/1030247538198061086/1273015298970095708/1275557151049257070
I tried in that thread to reproduce getting duplicate ids, but was unsuccessful. If you have additional info about how to reproduce, that would be helpful. We would like to prevent datasets from getting into this state in the first place.
Repro is a little tricky because a number of operations have been run on the dataset though mainly compaction & cleanup.
However, I just got a duplicate fragment on a clean dataset again after adding a scalar btree index
Okay just reproduced it. Dataset with columns id (string), image (bytes), caption (string), source (string), split (string).
Successfully added an index with
dataset.create_scalar_index("id", "BTREE")
However, running
In [158]: dataset.validate()
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Cell In[158], line 1
----> 1 dataset.validate()
File ~/anaconda3/lib/python3.10/site-packages/lance/dataset.py:1832, in LanceDataset.validate(self)
1825 def validate(self):
1826 """
1827 Validate the dataset.
1828
1829 This checks the integrity of the dataset and will raise an exception if
1830 the dataset is corrupted.
1831 """
-> 1832 self._ds.validate()
OSError: Encountered corrupt file lance/images_v2.lance: Duplicate fragment id 151871 found in dataset Path { raw: "lance/images_v2.lance" }, /home/runner/work/lance/lance/rust/lance/src/dataset.rs:1241:21
In [159]: lance.__version__
Out[159]: '0.17.0-beta.10'
I'd run the script but since it deletes all indices would be moot here
Stuck in a weird state now where it has duplicate fragments even without the index and the script isn't able to fix it
the script isn't able to fix it
Are you saying you run the script and it still has duplicates? Or running the script raises an error? If it's an error, what is the error?
The dataset still has duplicates after running the script
Oh that's probably just because we forgot to increment current_id. This should work:
import lance
import json
uri = "vectors"
ds = lance.dataset(uri)
frags = ds.get_fragments()
max_frag_id = max(frag.fragment_id for frag in frags)
current_id = max_frag_id + 1
new_frags = []
for frag in frags:
frag_json = frag.metadata.to_json()
frag_json["id"] = current_id
new_frag = lance.FragmentMetadata.from_json(json.dumps(frag_json))
new_frags.append(new_frag)
current_id += 1
operation = lance.LanceOperation.Overwrite(ds.schema, new_frags)
ds_new = lance.LanceDataset.commit(uri, operation, ds.version)
I also encounter this problem,and my lastest data version have about 284500000 rows . The first compaction is sucessful , however I get the error when I run second time using dataset.optimize.compact_files(max_bytes_per_file=1024*1024*32)
the error is below:
OSError: Query Execution error: Execution error: Row ids did not arrive in sorted order: integers are ordered up to the 0th element, /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/task/poll.rs:290:44