static-frame icon indicating copy to clipboard operation
static-frame copied to clipboard

Frame.from_concat should preserve the column order of the first frame, rather than sorting.

Open ForeverWintr opened this issue 5 years ago • 2 comments

When (vertically) stacking frames that have the same columns but different orders, StaticFrame sorts the columns of the resulting frame:

f1 = sf.Frame.from_element(1, index=[0], columns=['c', 'a', 'b'])
f2 = sf.Frame.from_element(1, index=[0], columns=['c', 'b', 'a'])
sf.Frame.from_concat((f1, f2), index=sf.IndexAutoFactory)
<Frame>
<Index> a       b       c       <<U1>
<Index>
0       1       1       1
1       1       1       1
<int64> <int64> <int64> <int64>

I think this is undesirable, as it introduces a third column ordering. In my case with real world data, a handful of my ~50 columns having different order between the stacked frames results in wildly different final column order, which initially appeared random.

Instead, I think it's more consistent (and perhaps simpler?) to just re-apply the order of the first frame's columns in this situation. In the cases I've encountered this so far, I've ended up using a helper function like this:

def stack_frames(frames: tp.Iterable[sf.Frame]) -> sf.Frame:
    '''Stack the given frames (vertically), assuming they have the same columns.
    Re-apply the same column order as the first frame.
    '''
    frames = tuple(frames)
    initial_order = frames[0].columns
    result = sf.Frame.from_concat(frames, index=sf.IndexAutoFactory)
    return result.reindex(columns=initial_order)

Thoughts?

ForeverWintr avatar Jul 14 '20 20:07 ForeverWintr

Thanks for this suggestion. The sorting is an artifact of performing set operations (union, generally) on the columns, which is necessary to determine final column constituents. An additional ordering could be applied in the special case you have identified (identical labels in a different ordering), but I suspect it would be no simpler or more efficient than your helper function. I will examine the implementation and continue to think about what options might be available.

flexatone avatar Jul 15 '20 13:07 flexatone

I forgot that concatenating means taking a union index! I see that my suggestion doesn't work in cases where the indexes don't have the same constituents.

Another solution might be to apply some form of cascading order priority, where column order of the first frame is preserved, then new columns introduced by the second frame have their order preserved, and so on. Although this is more complex, my intuition is that it's probably less expensive than a full sort, as well as being more intuitive. This seems to be the approach pandas takes, for what it's worth:

f1 = sf.Frame.from_element(1, index=[0], columns=['c', 'a', 'b', 'y'])
f2 = sf.Frame.from_element(1, index=[0], columns=['c', 'b', 'x', 'd', 'a'])

df1, df2 = (x.to_pandas() for x in (f1, f2))
pd.concat((df1, df2), sort=False, ignore_index=True)
   c  a  b    y    x    d
0  1  1  1  1.0  NaN  NaN
1  1  1  1  NaN  1.0  1.0

ForeverWintr avatar Jul 16 '20 17:07 ForeverWintr