npyjs icon indicating copy to clipboard operation
npyjs copied to clipboard

Does not work with structured arrays

Open jeffpeck10x opened this issue 2 years ago • 4 comments

.npy files can contain structured arrays, as described here, fail to open with error:

Uncaught SyntaxError: Unexpected token ( in JSON at position 28

Example, in python create the structured array:

np.save('test/out.npy', np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
             dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')]))

Serve the out.npy file:

cd test
npx serve

Try to load that out.npy file:

> np.load("http://localhost:3000/out.npy")
Promise {
  <pending>,
  [Symbol(async_id_symbol)]: 4454,
  [Symbol(trigger_async_id_symbol)]: 5
}
> Uncaught SyntaxError: Unexpected token ( in JSON at position 28

jeffpeck10x avatar Jan 08 '24 23:01 jeffpeck10x

Here is a rough idea of how to parse a structured array. It changes things a bit with the library, but this stand-alone example worked with a structured array that I was working with. Maybe this will be helpful for others.


const dtypes = {
  u1: {
    bytesPerElement: Uint8Array.BYTES_PER_ELEMENT,
    dvFnName: 'getUint8'
  },
  u2: {
    bytesPerElement: Uint16Array.BYTES_PER_ELEMENT,
    dvFnName: 'getUint16'
  },
  i1: {
    bytesPerElement: Int8Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt8'
  },
  i2: {
    bytesPerElement: Int16Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt16'
  },
  u4: {
    bytesPerElement: Int32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt32'
  },
  i4: {
    bytesPerElement: Int32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt32'
  },
  u8: {
    bytesPerElement: BigUint64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getBigUint64'
  },
  i8: {
    bytesPerElement: BigInt64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getBigInt64'
  },
  f4: {
    bytesPerElement: Float32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getFloat32'
  },
  f8: {
    bytesPerElement: Float64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getFloat64'
  },
};

function parse(arrayBufferContents) {
  const dv = new DataView(arrayBufferContents);

  const headerLength = dv.getUint16(8, true);
  const offsetBytes = 10 + headerLength;

  const hcontents = new TextDecoder("utf-8").decode(
    new Uint8Array(arrayBufferContents.slice(10, 10 + headerLength))
  );

  const [, descr, fortranOrder, shape] = hcontents.match(
    /{'descr': (.*), 'fortran_order': (.*), 'shape': (.*), }/
  );

  const columns = [...descr.matchAll(/\('([^']+)', '([\|<>])([^']+)'\)/g)].map(
    ([, columnName, endianess, dtype]) => ({
      columnName,
      littleEndian: endianess === "<",
      bytesPerElement: dtypes[dtype].bytesPerElement,
      dvFn: (...args) => dv[dtypes[dtype].dvFnName](...args),
    })
  );

  const [, numRows] = shape.match(/\((\d+),\)/);

  const stride = columns
    .map((c) => c.bytesPerElement)
    .reduce((sum, numBytes) => sum + numBytes, 0);

  const data = [];
  let i, c, offset, dataIdx, row;
  const numColumns = columns.length;
  for (i = offsetBytes; i < numRows; i++) {
    offset = 0;
    row = {};
    dataIdx = i * stride;
    for (c = 0; c < numColumns; c++) {
      row[columns[c].columnName] = columns[c].dvFn(dataIdx + offset, columns[c].littleEndian);
      offset += columns[c].bytesPerElement;
    }
    data.push(row);
  }

  return data;
}

jeffpeck10x avatar Jan 09 '24 07:01 jeffpeck10x

Wow thank you for the thoughtful comments @jeffpeck10x!! Would you be interested in turning this into a PR? Otherwise I'm happy to incorporate these suggestions in my next dev push :)

j6k4m8 avatar Feb 13 '24 16:02 j6k4m8

Thanks @j6k4m8 . I don't think I have the time to generalize this into a more complete approach. I did end up taking this a little farther, although ended up ultimately using Apache Arrow for my use-case. That said, here is some helpful code to get this started.

The following function should reliably fetch columns and numRow from the header:

function parseHeaderContents(hcontents) {
  const [, descr, fortranOrder, shape] = hcontents.match(
    /{'descr': (.*), 'fortran_order': (.*), 'shape': (.*), }/
  );

  let offset = 0;
  const columns = [...descr.matchAll(/\('([^']+)', '([\|<>])([^']+)'\)/g)].map(
    ([, columnName, endianess, dtype]) => {
      const littleEndian = endianess === "<";
      const ret = {
        columnName,
        dvFnName: dtypes[dtype].dvFnName,
        TypedArray: dtypes[dtype].TypedArray,
      };
      offset += dtypes[dtype].bytesPerElement;
      return ret;
    }
  );

  const [, numRows] = shape.match(/\((\d+),\)/);

  return { columns, numRows };
}

I ended up writing the following parse function to get at the data when I was experimenting:

function parse(arrayBufferContents) {
  const dv = new DataView(arrayBufferContents);

  const headerLength = dv.getUint16(8, true);
  const offsetBytes = 10 + headerLength;

  const hcontents = new TextDecoder("utf-8").decode(
    new Uint8Array(arrayBufferContents.slice(10, 10 + headerLength))
  );

  const { columns, numRows } = parseHeaderContents(hcontents);
  const columnNames = columns.map((c) => c.columnName);
  const offsets = columns
    .map((c) => c.TypedArray.BYTES_PER_ELEMENT)
    .reduce((arr, v, i) => [...arr, v + arr[i]], [0]);
  const stride = offsets.pop();
  const dvFnNames = columns.map((c) => c.dvFnName);
  const dataViews = offsets.map(
    (offset) => new DataView(arrayBufferContents, offsetBytes + offset)
  );
  const dataViewGetters = dataViews.map((dv, i) => dv[dvFnNames[i]].bind(dv));
  const data = columnNames.reduce(
    (obj, columnName, i) => ({
      ...obj,
      [columnName]: new columns[i].TypedArray(numRows),
    }),
    {}
  );

  for (let j = 0; j < columnNames.length; j++) {
    const columnName = columnNames[j];
    const getter = dataViewGetters[j];
    const column = data[columnName];

    for (let i = 0; i < numRows; i++) {
      column[i] = getter(i * stride, true);
    }
  }

  return data;
}

I think there are ideas from that which can be used to generalize the approach.

jeffpeck10x avatar Feb 13 '24 18:02 jeffpeck10x

I know this all goes in a slightly different direction than your library, although using your library as the base gave me a really good headstart, just to even realize that I can use DataViews and offsets, and where to find the metadata in the headers. So, if any of this helps to improve the versatility of your library, great! And full circle. Don't feel like any pressure to do this though. In my particular use-case, as mentioned, I am all set with Apache Arrow IPC (ultimately needed to send this data between two processes, so more suited anyway).

jeffpeck10x avatar Feb 13 '24 18:02 jeffpeck10x