Shortcut for df.join() and select all columns?
Hi, not sure if there is a better place to ask this, but I've been enjoying using dataforge, except that joining together multiple dataframes and selecting all columns becomes very verbose. Is there some way to just select all columns? I'm also not concerned with column collisions, as in my scenario, if the column name is the same between the tables, then it's a join column, and the values would be the same either way, so values could come from either the right or lefthand dataframe-
Do you have some sample code that you can show? We might be able to figure out another way to do it.
@ashleydavis Hi, sorry for delay, just getting back to this project. Yes, here's some sample code, and a sandbox it's running in:
import { DataFrame } from "data-forge";
//typescript util for getting an array of keys with correct typing
interface UnionArray<T> extends Omit<Array<T>, "concat"> {
concat<U>(...items: (U | ConcatArray<U>)[]): (T | U)[];
}
let dataOne = [
{ A: 1, B: 10 },
{ A: 2, B: 20 },
{ A: 3, B: 30 }
];
let dataTwo = [
{ A: 1, C: 4, D: 40 },
{ A: 2, C: 5, D: 50 },
{ A: 3, C: 6, D: 60 }
];
type DataOne = typeof dataOne[number];
type DataTwo = typeof dataTwo[number];
//The types here are of course inferred in this example, but this is added for clarity
let dfOne = new DataFrame<number, DataOne>({
values: dataOne
});
let dfTwo = new DataFrame<number, DataTwo>({
values: dataTwo
});
//This is the standard join approach
const dfJoined = dfOne.join(
dfTwo,
(left) => left.A,
(right) => right.A,
(left, right) => {
return {
A: left.A,
B: left.B,
C: right.C,
D: right.D
};
}
);
//Here's a way to do this automatically, but I'd like to not have to do all this
let dataOneColumns = dfOne.getColumnNames() as UnionArray<keyof DataOne>;
let dataTwoColumns = dfTwo.getColumnNames() as UnionArray<keyof DataTwo>;
const genericAllColumnSelectorFn = (left: DataOne, right: DataTwo) => {
return {
...dataOneColumns.reduce((acc, cur) => {
acc[cur] = left[cur]; //returns the "A: left.A" selector format
return acc;
}, {} as Record<keyof DataOne, string | number>),
...dataTwoColumns.reduce((acc, cur) => {
acc[cur] = right[cur]; //returns the "C: right.C" selector format
return acc;
}, {} as Record<keyof DataTwo, string | number>)
} as DataOne & DataTwo;
};
const dfJoinedSelectAll = dfOne.join(
dfTwo,
(left) => left.A,
(right) => right.A,
genericAllColumnSelectorFn
);
https://codesandbox.io/s/crazy-meadow-324fsi?file=/src/index.ts
So basically I'd like to select all the columns without having to explicitly define them all. I can use the dynamic approach above, but that also is verbose. However, it's better than writing out all the columns from a join with tables that each have 10-20 columns...
Thanks so much for the example code @chriszrc chriszrc
I do believe I have simpler way of doing this using the function DataFrame.merge, here's an updated example showing both complex and simple versions:
import { DataFrame } from "data-forge";
//typescript util for getting an array of keys with correct typing
interface UnionArray<T> extends Omit<Array<T>, "concat"> {
concat<U>(...items: (U | ConcatArray<U>)[]): (T | U)[];
}
let dataOne = [
{ A: 1, B: 10 },
{ A: 2, B: 20 },
{ A: 3, B: 30 }
];
let dataTwo = [
{ A: 1, C: 4, D: 40 },
{ A: 2, C: 5, D: 50 },
{ A: 3, C: 6, D: 60 }
];
type DataOne = typeof dataOne[number];
type DataTwo = typeof dataTwo[number];
//The types here are of course inferred in this example, but this is added for clarity
let dfOne = new DataFrame<number, DataOne>({
values: dataOne
});
let dfTwo = new DataFrame<number, DataTwo>({
values: dataTwo
});
//
// COMPLEX WAY OF DOING IT.
//
//This is the standard join approach
const dfJoined = dfOne.join(
dfTwo,
(left) => left.A,
(right) => right.A,
(left, right) => {
return {
A: left.A,
B: left.B,
C: right.C,
D: right.D
};
}
);
//Here's a way to do this automatically, but I'd like to not have to do all this
let dataOneColumns = dfOne.getColumnNames() as UnionArray<keyof DataOne>;
let dataTwoColumns = dfTwo.getColumnNames() as UnionArray<keyof DataTwo>;
const genericAllColumnSelectorFn = (left: DataOne, right: DataTwo) => {
return {
...dataOneColumns.reduce((acc, cur) => {
acc[cur] = left[cur]; //returns the "A: left.A" selector format
return acc;
}, {} as Record<keyof DataOne, string | number>),
...dataTwoColumns.reduce((acc, cur) => {
acc[cur] = right[cur]; //returns the "C: right.C" selector format
return acc;
}, {} as Record<keyof DataTwo, string | number>)
} as DataOne & DataTwo;
};
const dfJoinedSelectAll = dfOne.join(
dfTwo,
(left) => left.A,
(right) => right.A,
genericAllColumnSelectorFn
);
console.log(dfJoinedSelectAll.toString());
//
// SIMPLE WAY OF DOING IT.
//
const merged = DataFrame.merge([dfOne, dfTwo]);
console.log(merged.toString());
Output:
> npx ts-node example.ts
__index__ A B C D
--------- - -- - --
0 1 10 4 40
1 2 20 5 50
2 3 30 6 60
__index__ A B C D
--------- - -- - --
0 1 10 4 40
1 2 20 5 50
2 3 30 6 60
Please let me know if this works for you.
@ashleydavis Hi, hmm, merge appears to rely on the internal numeric index order of the rows? Is that right? What would that mean if I want to join on column B, or if the A values weren't in order?
I think merge relies on the value in the index, but not the particular order (even though in this example those are the same).