Field name silently cropped at 16 chars

Open alcinos opened this issue 3 years ago • 1 comments

Hi,

Consider the following example:

from ffcv.writer import DatasetWriter
from ffcv.fields import NDArrayField, FloatField, StringField, IntField

class Dummy:
    def __len__(self):
        return 10
    def __getitem__(self, id):
        return (42,)
ds = Dummy()
writer = DatasetWriter("/tmp/test.beton", {
    'this_is_a_very_long_field': IntField(),
    }, num_workers=2)
writer.from_indexed_dataset(ds)

loader = Loader('/tmp/test.beton', batch_size=2, num_workers=2, order=OrderOption.SEQUENTIAL)
print(list(loader.pipelines.keys())[0]) # prints 'this_is_a_very_l'

It's unclear to me if the 16 chars limit is intended (I haven't found any mention of that in the doc, but I may have missed it). If it's not, well it's a bug, and if it is, perhaps it would be appropriate to add an assert in the writer?

As it stands it's very confusing because no warning is issued, and I spent quite some time debugging an issue where my loading pipeline was not applied. I had something like pipelines = {"this_is_a_very_long_field": my_pipeline} and nothing was happening. I think this is the result of two fixable issues:

As mentioned above, since the field name was actually shortened to "this_is_a_very_l" the pipeline didn't work. Throwing an error when trying to create a field name too long (or supporting it) fixes this issue.
No warning was issued for creating a pipeline for a non-existent field. Adding a check to verify if each field used exists and issuing an error/warning otherwise would fix this issue.

Best

Feb 26 '22 06:02 alcinos

Hello,

This is indeed the case. FFCV file formats has a limit on the length of the field. We could definitely warn users when they use a name that is too long. Are you interested in submitting a pull request ?

Mar 02 '22 22:03 GuillaumeLeclerc