lance icon indicating copy to clipboard operation
lance copied to clipboard

`compact_files` raises OSError: LanceError(IO): Execution error: Row ids did not arrive in sorted order: integers are ordered up to the 0th element`

Open tonyf opened this issue 1 year ago • 9 comments

In [57]: dataset.optimize.compact_files(max_bytes_per_file=1024*1024*256, batch_size=1024, num_threads=100)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[57], line 1
----> 1 dataset.optimize.compact_files(max_bytes_per_file=1024*1024*256, batch_size=1024, num_threads=100)

File ~/anaconda3/lib/python3.10/site-packages/lance/dataset.py:2624, in DatasetOptimizer.compact_files(self, target_rows_per_fragment, max_rows_per_group, max_bytes_per_file, materialize_deletions, materialize_deletions_threshold, num_threads, batch_size)
   2562 """Compacts small files in the dataset, reducing total number of files.
   2563
   2564 This does a few things:
   (...)
   2613 lance.optimize.Compaction
   2614 """
   2615 opts = dict(
   2616     target_rows_per_fragment=target_rows_per_fragment,
   2617     max_rows_per_group=max_rows_per_group,
   (...)
   2622     batch_size=batch_size,
   2623 )
-> 2624 return Compaction.execute(self._dataset, opts)

OSError: LanceError(IO): Execution error: Row ids did not arrive in sorted order: integers are ordered up to the 0th element, /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/task/poll.rs:288:44

Let me know what other debug information I can provide

tonyf avatar Aug 29 '24 14:08 tonyf

I haven't seen this one in a long while. @wjones127 did stable row ids change anything here you think?

Can you dump your fragment info?

import lance.debug
print(lance.debug.format_manifest(dataset))

Also, can you run validate?

dataset.validate()

westonpace avatar Aug 29 '24 17:08 westonpace

Manifest: https://gist.github.com/tonyf/45c36a6c20b8b633964ca5295bf7be10

In [7]: dataset.validate()
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[7], line 1
----> 1 dataset.validate()

File ~/workspace/datasets/.venv/lib/python3.10/site-packages/lance/dataset.py:1820, in LanceDataset.validate(self)
   1813 def validate(self):
   1814     """
   1815     Validate the dataset.
   1816
   1817     This checks the integrity of the dataset and will raise an exception if
   1818     the dataset is corrupted.
   1819     """
-> 1820     self._ds.validate()

OSError: Encountered corrupt file lance/images.lance: Duplicate fragment id 302279 found in dataset Path { raw: "lance/images.lance" }, /home/runner/work/lance/lance/rust/lance/src/dataset.rs:1227:21

tonyf avatar Aug 29 '24 19:08 tonyf

Hmm this sounds like an earlier thread where someone else got duplicate fragment ids: https://discord.com/channels/1030247538198061086/1273015298970095708

I provided a snippet that can undo the duplication of fragment ids: https://discord.com/channels/1030247538198061086/1273015298970095708/1275557151049257070

I tried in that thread to reproduce getting duplicate ids, but was unsuccessful. If you have additional info about how to reproduce, that would be helpful. We would like to prevent datasets from getting into this state in the first place.

wjones127 avatar Aug 29 '24 20:08 wjones127

Repro is a little tricky because a number of operations have been run on the dataset though mainly compaction & cleanup.

However, I just got a duplicate fragment on a clean dataset again after adding a scalar btree index

tonyf avatar Aug 29 '24 21:08 tonyf

Okay just reproduced it. Dataset with columns id (string), image (bytes), caption (string), source (string), split (string).

Successfully added an index with

dataset.create_scalar_index("id", "BTREE")

However, running

In [158]: dataset.validate()
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[158], line 1
----> 1 dataset.validate()

File ~/anaconda3/lib/python3.10/site-packages/lance/dataset.py:1832, in LanceDataset.validate(self)
   1825 def validate(self):
   1826     """
   1827     Validate the dataset.
   1828
   1829     This checks the integrity of the dataset and will raise an exception if
   1830     the dataset is corrupted.
   1831     """
-> 1832     self._ds.validate()

OSError: Encountered corrupt file lance/images_v2.lance: Duplicate fragment id 151871 found in dataset Path { raw: "lance/images_v2.lance" }, /home/runner/work/lance/lance/rust/lance/src/dataset.rs:1241:21
In [159]: lance.__version__
Out[159]: '0.17.0-beta.10'

I'd run the script but since it deletes all indices would be moot here

tonyf avatar Aug 29 '24 21:08 tonyf

Stuck in a weird state now where it has duplicate fragments even without the index and the script isn't able to fix it

tonyf avatar Aug 29 '24 22:08 tonyf

the script isn't able to fix it

Are you saying you run the script and it still has duplicates? Or running the script raises an error? If it's an error, what is the error?

wjones127 avatar Aug 29 '24 22:08 wjones127

The dataset still has duplicates after running the script

tonyf avatar Aug 29 '24 23:08 tonyf

Oh that's probably just because we forgot to increment current_id. This should work:

import lance
import json

uri = "vectors"
ds = lance.dataset(uri)

frags = ds.get_fragments()
max_frag_id = max(frag.fragment_id for frag in frags)

current_id = max_frag_id + 1
new_frags = []
for frag in frags:
    frag_json = frag.metadata.to_json()
    frag_json["id"] = current_id
    new_frag = lance.FragmentMetadata.from_json(json.dumps(frag_json))
    new_frags.append(new_frag)
    current_id += 1

operation = lance.LanceOperation.Overwrite(ds.schema, new_frags)

ds_new = lance.LanceDataset.commit(uri, operation, ds.version)

wjones127 avatar Aug 29 '24 23:08 wjones127

I also encounter this problem,and my lastest data version have about 284500000 rows . The first compaction is sucessful , however I get the error when I run second time using dataset.optimize.compact_files(max_bytes_per_file=1024*1024*32) the error is below: OSError: Query Execution error: Execution error: Row ids did not arrive in sorted order: integers are ordered up to the 0th element, /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/task/poll.rs:290:44

Joseph1314 avatar Apr 18 '25 12:04 Joseph1314