feat: add "returning" search option to select only specified fields from a document
Implements https://github.com/askorama/orama/issues/769.
This PR introduces the ability to pass an array of document fields to be returned via the returning option.
Initially, I considered using the existing getDocumentProperties() function. However, this function does not preserve the original structure of objects. Moreover, when dealing with nested objects, it only returns the deepest fields. This behavior forces users to specify all properties if they want to return the entire object, which can be cumbersome.
Given that getDocumentProperties() is widely used throughout the codebase, I decided to create a new function called pickDocumentProperties() that preserves the original structure of objects, allowing users to specify top-level keys for nested objects, which simplifies the process of selecting which fields to return.
It's important to note that the includeVectors option is skipped if the returning option is also provided. It seemed more logical to me to prioritize user selection, although this behavior can be subject to further discussion.
If everything looks good, I’ll proceed with adding the corresponding tests.
@fasenderos is attempting to deploy a commit to the OramaSearch Team on Vercel.
A member of the Team first needs to authorize it.
Hi @fasenderos, thanks for your PR!
This solution creates a lot of temporary objects that slows down the application.
Returning to the original question (https://github.com/askorama/orama/issues/769), the application wants to reduce the network traffic usage by specifying which properties return.
Is this could be achieved with some special serialization method? I'm thinking how fastify allows an output schema to speed up the serialization process. Could you use a library like fast-json-stringify to address that topic?
Hi @allevo thanks for the reply.
Is this could be achieved with some special serialization method? I'm thinking how fastify allows an output schema to speed up the serialization process. Could you use a library like fast-json-stringify to address that topic?
Yes, you can also achieve the same result with fast-json-stringify.
Before starting the implementation, I wanted to run some tests to check performance (time and memory usage). Based on the results, fast-json-stringify is always the fastest, while pickDocumentProperties() consistently uses the least memory. The "problem" with fast-json-stringify is that it would return an array of strings in the hits that would need to be parsed. I therefore also added tests with JSON.parse() and fast-json-parse, and these latter ones are consistently the slowest.
If you think I should add more use cases or modify the tests, let me know. I can still proceed with using fast-json-stringify, but I’d like to know if the data returned in the hits needs to be parsed or not.
Below are the results, and at the bottom, the script used.
1000 Runs on 10/100/1000 documents and serializing 1 property
RESULTS SUMMARY FOR 1000 Runs - Serializing 1 properties in 10 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.001 (fastest)
2° pick-document-properties in 0.0017
3° fast-json-stringify-and-normal-parse in 0.0042
4° fast-json-stringify-and-fast-parse in 0.0048
Memory used in byte (50% Percentile)
1° pick-document-properties in 1160 (least consuming)
2° fast-json-stringify in 1480
3° fast-json-stringify-and-normal-parse in 2200
4° fast-json-stringify-and-fast-parse in 2600
RESULTS SUMMARY FOR 1000 Runs - Serializing 1 properties in 100 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0047 (fastest)
2° pick-document-properties in 0.0089
3° fast-json-stringify-and-normal-parse in 0.0286
4° fast-json-stringify-and-fast-parse in 0.0341
Memory used in byte (50% Percentile)
1° pick-document-properties in 9800 (least consuming)
2° fast-json-stringify in 13000
3° fast-json-stringify-and-normal-parse in 20200
4° fast-json-stringify-and-fast-parse in 24200
RESULTS SUMMARY FOR 1000 Runs - Serializing 1 properties in 1000 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0429 (fastest)
2° pick-document-properties in 0.0813
3° fast-json-stringify-and-normal-parse in 0.286
4° fast-json-stringify-and-fast-parse in 0.3431
Memory used in byte (50% Percentile)
1° pick-document-properties in 96208 (least consuming)
2° fast-json-stringify in 128208
3° fast-json-stringify-and-normal-parse in 200216
4° fast-json-stringify-and-fast-parse in 240224
1000 Runs on 10/100/1000 documents and serializing 3 properties
RESULTS SUMMARY FOR 1000 Runs - Serializing 3 properties in 10 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0018 (fastest)
2° pick-document-properties in 0.0039
3° fast-json-stringify-and-normal-parse in 0.0062
4° fast-json-stringify-and-fast-parse in 0.0068
Memory used in byte (50% Percentile)
1° pick-document-properties in 1960 (least consuming)
2° fast-json-stringify in 3400
3° fast-json-stringify-and-normal-parse in 4760
4° fast-json-stringify-and-fast-parse in 5160
RESULTS SUMMARY FOR 1000 Runs - Serializing 3 properties in 100 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0071 (fastest)
2° pick-document-properties in 0.029
3° fast-json-stringify-and-normal-parse in 0.0442
4° fast-json-stringify-and-fast-parse in 0.0501
Memory used in byte (50% Percentile)
1° pick-document-properties in 17800 (least consuming)
2° fast-json-stringify in 32200
3° fast-json-stringify-and-normal-parse in 45800
4° fast-json-stringify-and-fast-parse in 49800
RESULTS SUMMARY FOR 1000 Runs - Serializing 3 properties in 1000 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0669 (fastest)
2° pick-document-properties in 0.2888
3° fast-json-stringify-and-normal-parse in 0.4408
4° fast-json-stringify-and-fast-parse in 0.4989
Memory used in byte (50% Percentile)
1° pick-document-properties in 176224 (least consuming)
2° fast-json-stringify in 320224
3° fast-json-stringify-and-normal-parse in 456248
4° fast-json-stringify-and-fast-parse in 496248
1000 Runs on 10/100/1000 documents and serializing 6 properties
RESULTS SUMMARY FOR 1000 Runs - Serializing 6 properties in 10 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0023 (fastest)
2° pick-document-properties in 0.0068
3° fast-json-stringify-and-normal-parse in 0.0093
4° fast-json-stringify-and-fast-parse in 0.0098
Memory used in byte (50% Percentile)
1° pick-document-properties in 3480 (least consuming)
2° fast-json-stringify in 6840
3° fast-json-stringify-and-normal-parse in 9080
4° fast-json-stringify-and-fast-parse in 9480
RESULTS SUMMARY FOR 1000 Runs - Serializing 6 properties in 100 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0142 (fastest)
2° pick-document-properties in 0.0581
3° fast-json-stringify-and-normal-parse in 0.0732
4° fast-json-stringify-and-fast-parse in 0.0784
Memory used in byte (50% Percentile)
1° pick-document-properties in 33000 (least consuming)
2° fast-json-stringify in 66600
3° fast-json-stringify-and-normal-parse in 89000
4° fast-json-stringify-and-fast-parse in 93000
RESULTS SUMMARY FOR 1000 Runs - Serializing 6 properties in 1000 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.1385 (fastest)
2° pick-document-properties in 0.5851
3° fast-json-stringify-and-normal-parse in 0.7507
4° fast-json-stringify-and-fast-parse in 0.8038
Memory used in byte (50% Percentile)
1° pick-document-properties in 328224 (least consuming)
2° fast-json-stringify in 664232
3° fast-json-stringify-and-normal-parse in 888256
4° fast-json-stringify-and-fast-parse in 928288
1000 Runs on 10/100/1000 documents and serializing 8 properties where one property is an entire object and another one is a single nested property of an object
RESULTS SUMMARY FOR 1000 Runs - Serializing 8 properties in 10 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.005 (fastest)
2° pick-document-properties in 0.0097
3° fast-json-stringify-and-normal-parse in 0.0243
4° fast-json-stringify-and-fast-parse in 0.0252
Memory used in byte (50% Percentile)
1° pick-document-properties in 5240 (least consuming)
2° fast-json-stringify in 21240
3° fast-json-stringify-and-normal-parse in 28200
4° fast-json-stringify-and-fast-parse in 28600
RESULTS SUMMARY FOR 1000 Runs - Serializing 8 properties in 100 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0426 (fastest)
2° pick-document-properties in 0.0895
3° fast-json-stringify-and-normal-parse in 0.2381
4° fast-json-stringify-and-fast-parse in 0.2459
Memory used in byte (50% Percentile)
1° pick-document-properties in 50600 (least consuming)
2° fast-json-stringify in 210608
3° fast-json-stringify-and-normal-parse in 280216
4° fast-json-stringify-and-fast-parse in 284208
RESULTS SUMMARY FOR 1000 Runs - Serializing 8 properties in 1000 docs
Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.4319 (fastest)
2° pick-document-properties in 0.9126
3° fast-json-stringify-and-normal-parse in 2.5254
4° fast-json-stringify-and-fast-parse in 2.5811
Memory used in byte (50% Percentile)
1° pick-document-properties in 504248 (least consuming)
2° fast-json-stringify in 2104264
3° fast-json-stringify-and-normal-parse in 2800488
4° fast-json-stringify-and-fast-parse in 2840296
Here is the script used for testing npx tsx ./packages/orama/test-orama.ts
import fastJson from "fast-json-stringify"
import fastParse from "fast-json-parse"
import { pickDocumentProperties } from "./src/utils"
const serialize = {
// Return 1 prop for each document
'props-1': {
fastJson: fastJson({ type: 'object', properties: { string1: { type: 'string' }}}),
pick: ['string1']
},
// Return 3 props for each document
'props-3': {
fastJson: fastJson({
type: 'object',
properties: {
string1: { type: 'string' },
number1: { type: 'number' },
bool1: { type: 'boolean' }
}
}),
pick: ['string1', 'number1', 'bool1']
},
// Return 6 props for each document
'props-6': {
fastJson: fastJson({
type: 'object',
properties: {
string1: { type: 'string' },
string2: { type: 'string' },
number1: { type: 'number' },
number2: { type: 'number' },
bool1: { type: 'boolean' },
bool2: { type: 'boolean' }
}
}),
pick: ['string1', 'string2', 'number1', 'number2','bool1', 'bool2']
},
// Return 8 props for each document where 1 is an entire object and 1 is a single nested prop of another object
'props-8': {
fastJson: fastJson({
type: 'object',
properties: {
string1: { type: 'string' },
string2: { type: 'string' },
number1: { type: 'number' },
number2: { type: 'number' },
bool1: { type: 'boolean' },
bool2: { type: 'boolean' },
// entire object
object1: {
type: 'object',
properties: {
string1: { type: 'string' },
string2: { type: 'string' },
number1: { type: 'number' },
number2: { type: 'number' },
bool1: { type: 'boolean' },
bool2: { type: 'boolean' },
nested: {
type: 'object',
properties: {
string1: { type: 'string' },
number1: { type: 'number' },
bool1: { type: 'boolean' },
}
}
}
},
// single nested fields object2.nested.string1
object2: {
type: 'object',
properties: {
nested: {
type: 'object',
properties: {
string1: { type: 'string' }
}
}
}
}
}
}),
pick: ['string1', 'string2', 'number1', 'number2','bool1', 'bool2', 'object1', 'object2.nested.string1']
}
}
function getNDocuments(n: number) {
const response: any = []
for (let index = 0; index < n; index++) {
response.push({
string1: 'foo bar',
string2: 'foo bar',
number1: 99.99,
number2: 99.99,
bool1: false,
bool2: true,
object1: {
string1: 'foo bar',
string2: 'foo bar',
number1: 99.99,
number2: 99.99,
bool1: false,
bool2: true,
nested: {
string1: 'foo bar',
number1: 99.99,
bool1: false,
}
},
object2: {
string1: 'foo bar',
string2: 'foo bar',
number1: 99.99,
number2: 99.99,
bool1: false,
bool2: true,
nested: {
string1: 'foo bar',
number1: 99.99,
bool1: false,
}
},
})
}
return response
}
function profiling(fn: (docs, props) => void, label: string, docs: any[], props: number) {
const memoryBefore = process.memoryUsage().heapUsed;
const start = performance.now();
fn(docs, props);
const end = performance.now();
const memoryAfter = process.memoryUsage().heapUsed;
const time = end - start
const memory = memoryAfter - memoryBefore
return { label, time, memory, count: docs.length, props }
}
const groupBy = (array, key) => {
return array.reduce((result, currentValue) => {
const groupKey = currentValue[key];
if (!result[groupKey]) result[groupKey] = [];
result[groupKey].push(currentValue);
return result;
}, {});
};
const percentile = (arr, p) => {
const index = Math.ceil(arr.length * (p / 100)) - 1;
return arr[index];
}
const mean = (arr, prop) => {
return arr.reduce((sum, item) => sum + item[prop], 0) / arr.length;
}
const roundTo = (num, decimals = 2) => {
const factor = Math.pow(10, decimals);
return Math.round(num * factor) / factor;
}
function printResults(results){
// { 10: [], 100: []}
const groupedByRuns = groupBy(results, 'count');
for (const docs in groupedByRuns) {
const groupedByLabel = groupBy(groupedByRuns[docs], 'label')
const summary: any = {
runs: 0,
time: [],
memory: []
}
// { fast-json-stringify: [], 'pick-document-properties': [] }
for (const label in groupedByLabel) {
const timeOrdered = [...groupedByLabel[label]]
timeOrdered.sort((a, b) => a.time - b.time);
const memoryOrdered = [...groupedByLabel[label]]
memoryOrdered.sort((a, b) => a.memory - b.memory);
const bestTime = timeOrdered[0];
const worstTime = timeOrdered[timeOrdered.length - 1]
const bestMemory = memoryOrdered[0];
const worstMemory = memoryOrdered[memoryOrdered.length - 1]
const avgTime = mean(timeOrdered, 'time')
const avgMemory = mean(memoryOrdered, 'memory')
const timePercentile25 = percentile(timeOrdered, 25)
const timePercentile50 = percentile(timeOrdered, 50)
const timePercentile75 = percentile(timeOrdered, 75)
const timePercentile95 = percentile(timeOrdered, 95)
const memoryPercentile25 = percentile(memoryOrdered, 25)
const memoryPercentile50 = percentile(memoryOrdered, 50)
const memoryPercentile75 = percentile(memoryOrdered, 75)
const memoryPercentile95 = percentile(memoryOrdered, 95)
summary.time.push(timePercentile50)
summary.memory.push(memoryPercentile50)
summary.runs = timeOrdered.length
console.log(label)
console.table([
{
"Stats": 'Time ms',
"25%": roundTo(timePercentile25.time, 4),
"50%": roundTo(timePercentile50.time, 4),
"75%": roundTo(timePercentile75.time, 4),
"95%": roundTo(timePercentile95.time, 4),
"Average (Mean)": roundTo(avgTime, 4),
"Best (Min)": roundTo(bestTime.time, 4),
"Worst (Max)": roundTo(worstTime.time, 4),
},
{
"Stats": 'Memory byte',
"25%": roundTo(memoryPercentile25.memory, 4),
"50%": roundTo(memoryPercentile50.memory, 4),
"75%": roundTo(memoryPercentile75.memory, 4),
"95%": roundTo(memoryPercentile95.memory, 4),
"Average (Mean)": roundTo(avgMemory, 4),
"Best (Min)": roundTo(bestMemory.memory, 4),
"Worst (Max)": roundTo(worstMemory.memory, 4),
}]
);
}
summary.time.sort((a, b) => a.time - b.time)
summary.memory.sort((a, b) => a.memory - b.memory)
const fastest = summary.time[0]
console.log(`\n\nRESULTS SUMMARY FOR ${summary.runs} Runs - Serializing ${fastest.props} properties in ${fastest.count} docs`)
console.log(`\nTime elapsed in ms (50% Percentile)`)
summary.time.forEach((item, index) => {
console.log(`${index + 1}° ${item.label} in ${roundTo(item.time, 4)}${index === 0 ? ' (fastest)' : ''}`);
})
console.log(`\nMemory used in byte (50% Percentile)`)
summary.memory.forEach((item, index) => {
console.log(`${index + 1}° ${item.label} in ${roundTo(item.memory, 4)}${index === 0 ? ' (least consuming)' : ''}`);
})
}
}
function useFastJson(docs, props) {
const serializer = serialize[`props-${props}`].fastJson
return docs.map((doc) => serializer(doc))
}
function useFastJsonAndNormalParse(docs, props) {
const serializer = serialize[`props-${props}`].fastJson
return docs.map((doc) => JSON.parse(serializer(doc)))
}
function useFastJsonAndFastParse(docs, props) {
const serializer = serialize[`props-${props}`].fastJson
return docs.map((doc) => fastParse(serializer(doc)).value)
}
function usePickDocumentProperties(docs, props){
const properties = serialize[`props-${props}`].pick
return docs.map((doc) => pickDocumentProperties(doc, properties))
}
const runs = (docs, props) => {
const results: any = []
for (let i = 0; i < 1000; i++) {
results.push(profiling(useFastJson, 'fast-json-stringify', docs, props))
results.push(profiling(useFastJsonAndNormalParse, 'fast-json-stringify-and-normal-parse', docs, props))
results.push(profiling(useFastJsonAndFastParse, 'fast-json-stringify-and-fast-parse', docs, props))
results.push(profiling(usePickDocumentProperties, 'pick-document-properties', docs, props))
}
printResults(results)
}
const init = (props: 1 | 3 | 6 | 8) => {
const docs = getNDocuments(10000)
const docs_10 = docs.slice(0, 10)
const docs_100 = docs.slice(0, 100)
const docs_1000 = docs.slice(0, 1000)
const docs_10000 = docs.slice(0, 10000)
// Execute 1000 runs on 10 docs
runs(docs_10, props)
// Execute 1000 runs on 100 docs
runs(docs_100, props)
// Execute 1000 runs on 1.000 docs
runs(docs_1000, props)
// Execute 1000 runs on 10.000 docs
runs(docs_10000, props)
}
init(1) // Serialize 1 prop for each document
init(3) // Serialize 3 props for each document
init(6) // Serialize 6 props for each document
init(8) // Serialize 8 props for each document where 1 is an entire object and 1 is a single nested prop of another object
One other thing to consider: @orama/orama must remain dependencies-free
One other thing to consider:
@orama/oramamust remain dependencies-free
Ok, is there anything I need to do on this PR (besides the tests)?
closed due to inactivity