fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Overflow issue with Fairseq Preprocess for large datasets

Open henrycharlesworth opened this issue 1 year ago • 3 comments

🐛 Bug

I realise no one is maintaining this anymore, but just for anyone who might come across a similar issue which was hard to debug:

With the default binarized dataset type in fairseq preprocess (mmap), it is possible to get integer overflow errors when processing big datasets. The key snippet of code is in fairseq/data/indexed_dataset.py:

@staticmethod
def _get_pointers(sizes):
    dtype_size = dtype().itemsize
    address = 0
    pointers = []

    for size in sizes:
        pointers.append(address)
        address += size * dtype_size

    return pointers

for some reason, when using multiple workers it is possible for some of the values in sizes to be np.int32, rather than int. I have not worked out why this is. However, for large enough datasets this can lead to integer overflow (as address becomes type np.int32 rather than int).

The fix is just to change:

address += int(size * dtype_size)

henrycharlesworth avatar Aug 07 '24 09:08 henrycharlesworth

However, for large enough datasets this can lead to integer overflow (as address becomes type np.int32 rather than int).

Aren't the ranges of np.int32 and int the same (from -2,147,483,648 to 2,147,483,647)

abdr17 avatar Feb 04 '25 05:02 abdr17

No, python integers are constrained only by available memory, so you won't get overflow.

henrycharlesworth avatar Feb 04 '25 10:02 henrycharlesworth

Hi, I encountered a CPU memory OOM issue while training Hubert. The problem is that load_label_offsetand load_audioin fairseq/data/audio/hubert_dataset.pyload all the data into a list at once. Are there any good solutions for this? I have roughly 200 million data entries.

VJJJJJJ1 avatar Nov 05 '25 03:11 VJJJJJJ1