Cannot create a string longer than 0x1fffffe8 characters when using data-persistence in server
Describe the bug
When trying to persist a large amount of data using the persistToFile function, Node.js throws an error: Cannot create a string longer than 0x1fffffe8 characters. This error is due to the V8 engine's limitation on string size.
To Reproduce
- Create a large dataset (larger than the V8 string size limit).
- Try to persist this data using the
persistToFilefunction.
Expected behavior
The data should be successfully persisted to the file without any errors.
Environment Info
OS: MacOS
Node: 20.7.0
Orama: 2.0.0 beta7
Affected areas
Data Insertion
Additional context
Possible Solution:
Consider implementing a streaming approach to write the data to the file, which would avoid having to convert the entire Buffer to a string at once.
Thanks for opening this. @allevo we should rework the persistence plugin if we can reproduce this
Hi @imertz ! Have you tried with a different format? For instance https://docs.oramasearch.com/open-source/plugins/plugin-data-persistence#persisting-the-database-to-disk-server-usage . Will that fit your case?
Hi @imertz ! Have you tried with a different format? For instance https://docs.oramasearch.com/open-source/plugins/plugin-data-persistence#persisting-the-database-to-disk-server-usage . Will that fit your case?
I'll try it out and come back to you.
IIRC dpack worked for persiting the file to disk but if the file is larger than 512mb the restoreFromFile won't work. This mainly because of all implementations for restore rely on toString() method at some point. Which means it tries to create string over 512mb. So while writing to/restoring from file, it should be read with fs.createReadStream and written with fs.createWriteStream.
Here's a naive implementation with streaming support for Node.js with @msgpack/msgpack (basically the current binary format solution with streaming support):
import type { AnyOrama, RawData } from '@orama/orama';
import { create, load, save } from '@orama/orama';
import fs from 'fs';
import { decode, encode } from '@msgpack/msgpack';
export const persistToFile = async (
db: AnyOrama,
outputFile: string,
) => {
const dbExport = await save(db);
const msgpack = encode(dbExport);
const serialized = Buffer.from(
msgpack.buffer,
msgpack.byteOffset,
msgpack.byteLength,
);
const writeStream = fs.createWriteStream(outputFile);
const chunkSize = 1024;
for (let i = 0; i < serialized.length; i += chunkSize) {
const end = Math.min(i + chunkSize, serialized.length);
const chunk = serialized.slice(i, end);
const hexChunk = chunk.toString('hex');
writeStream.write(hexChunk);
}
writeStream.end();
writeStream.on('finish', () => {
console.log('File has been written as', outputFile);
});
};
const deserialize = async (inputFile: string) => {
return new Promise<RawData>((resolve, reject) => {
const readStream = fs.createReadStream(inputFile, {
encoding: 'utf8',
// highWaterMark: 1024,
});
const chunks: Buffer[] = [];
readStream.on('data', (chunk: string) => {
chunks.push(Buffer.from(chunk, 'hex'));
});
readStream.on('end', () => {
const combinedBuffer = Buffer.concat(chunks);
const decodedData = decode(Buffer.from(combinedBuffer));
resolve(decodedData as RawData);
});
readStream.on('error', (err) => {
reject(err);
});
});
};
export const restoreFromFile = async (inputFile: string) => {
const deserialized = await deserialize(inputFile);
const db = await create({
schema: {
__placeholder: 'string',
},
});
await load(db, deserialized);
return db;
};
Disclaimer: I extracted these functions from from larger codebase so I haven't actually ran this exact piece of code but hopefully this helps. Also not sure if the chunking part on persistToFile function is the way to got but it worked for me.
We also noticed that you can write the msgpack encoded binary directly to file instead of turning it to hex before writing. This makes the msp file half the size.