orama Cannot create a string longer than 0x1fffffe8 characters when using data-persistence in server

Describe the bug

When trying to persist a large amount of data using the persistToFile function, Node.js throws an error: Cannot create a string longer than 0x1fffffe8 characters. This error is due to the V8 engine's limitation on string size.

To Reproduce

Create a large dataset (larger than the V8 string size limit).
Try to persist this data using the persistToFile function.

Expected behavior

The data should be successfully persisted to the file without any errors.

Environment Info

OS: MacOS
Node: 20.7.0
Orama: 2.0.0 beta7

Affected areas

Data Insertion

Additional context

Possible Solution:

Consider implementing a streaming approach to write the data to the file, which would avoid having to convert the entire Buffer to a string at once.

Nov 20 '23 21:11 imertz

Thanks for opening this. @allevo we should rework the persistence plugin if we can reproduce this

Dec 01 '23 17:12 micheleriva

Hi @imertz ! Have you tried with a different format? For instance https://docs.oramasearch.com/open-source/plugins/plugin-data-persistence#persisting-the-database-to-disk-server-usage . Will that fit your case?

Dec 01 '23 17:12 allevo

Hi @imertz ! Have you tried with a different format? For instance https://docs.oramasearch.com/open-source/plugins/plugin-data-persistence#persisting-the-database-to-disk-server-usage . Will that fit your case?

I'll try it out and come back to you.

Dec 06 '23 14:12 imertz

IIRC dpack worked for persiting the file to disk but if the file is larger than 512mb the restoreFromFile won't work. This mainly because of all implementations for restore rely on toString() method at some point. Which means it tries to create string over 512mb. So while writing to/restoring from file, it should be read with fs.createReadStream and written with fs.createWriteStream.

Here's a naive implementation with streaming support for Node.js with @msgpack/msgpack (basically the current binary format solution with streaming support):

import type { AnyOrama, RawData } from '@orama/orama';
import { create, load, save } from '@orama/orama';
import fs from 'fs';
import { decode, encode } from '@msgpack/msgpack';

export const persistToFile = async (
  db: AnyOrama,
  outputFile: string,
) => {

  const dbExport = await save(db);
  const msgpack = encode(dbExport);
  const serialized = Buffer.from(
    msgpack.buffer,
    msgpack.byteOffset,
    msgpack.byteLength,
  );

  const writeStream = fs.createWriteStream(outputFile);
  const chunkSize = 1024;
  for (let i = 0; i < serialized.length; i += chunkSize) {
    const end = Math.min(i + chunkSize, serialized.length);
    const chunk = serialized.slice(i, end);
    const hexChunk = chunk.toString('hex');
    writeStream.write(hexChunk);
  }
  writeStream.end();

  writeStream.on('finish', () => {
    console.log('File has been written as', outputFile);
  });
};

const deserialize = async (inputFile: string) => {
  return new Promise<RawData>((resolve, reject) => {
    const readStream = fs.createReadStream(inputFile, {
      encoding: 'utf8',
      // highWaterMark: 1024,
    });
    const chunks: Buffer[] = [];
    readStream.on('data', (chunk: string) => {
      chunks.push(Buffer.from(chunk, 'hex'));
    });

    readStream.on('end', () => {
      const combinedBuffer = Buffer.concat(chunks);
      const decodedData = decode(Buffer.from(combinedBuffer));
      resolve(decodedData as RawData);
    });
    readStream.on('error', (err) => {
      reject(err);
    });
  });
};

export const restoreFromFile = async (inputFile: string) => {
  const deserialized = await deserialize(inputFile);
  const db = await create({
    schema: {
      __placeholder: 'string',
    },
  });
  await load(db, deserialized);
  return db;
};

Disclaimer: I extracted these functions from from larger codebase so I haven't actually ran this exact piece of code but hopefully this helps. Also not sure if the chunking part on persistToFile function is the way to got but it worked for me.

Feb 06 '24 09:02 valstu

We also noticed that you can write the msgpack encoded binary directly to file instead of turning it to hex before writing. This makes the msp file half the size.

Feb 12 '24 09:02 valstu