static-frame Frame.from_concat should preserve the column order of the first frame, rather than sorting.

When (vertically) stacking frames that have the same columns but different orders, StaticFrame sorts the columns of the resulting frame:

f1 = sf.Frame.from_element(1, index=[0], columns=['c', 'a', 'b'])
f2 = sf.Frame.from_element(1, index=[0], columns=['c', 'b', 'a'])
sf.Frame.from_concat((f1, f2), index=sf.IndexAutoFactory)
<Frame>
<Index> a       b       c       <<U1>
<Index>
0       1       1       1
1       1       1       1
<int64> <int64> <int64> <int64>

I think this is undesirable, as it introduces a third column ordering. In my case with real world data, a handful of my ~50 columns having different order between the stacked frames results in wildly different final column order, which initially appeared random.

Instead, I think it's more consistent (and perhaps simpler?) to just re-apply the order of the first frame's columns in this situation. In the cases I've encountered this so far, I've ended up using a helper function like this:

def stack_frames(frames: tp.Iterable[sf.Frame]) -> sf.Frame:
    '''Stack the given frames (vertically), assuming they have the same columns.
    Re-apply the same column order as the first frame.
    '''
    frames = tuple(frames)
    initial_order = frames[0].columns
    result = sf.Frame.from_concat(frames, index=sf.IndexAutoFactory)
    return result.reindex(columns=initial_order)

Thoughts?

Jul 14 '20 20:07 ForeverWintr

Thanks for this suggestion. The sorting is an artifact of performing set operations (union, generally) on the columns, which is necessary to determine final column constituents. An additional ordering could be applied in the special case you have identified (identical labels in a different ordering), but I suspect it would be no simpler or more efficient than your helper function. I will examine the implementation and continue to think about what options might be available.

Jul 15 '20 13:07 flexatone

I forgot that concatenating means taking a union index! I see that my suggestion doesn't work in cases where the indexes don't have the same constituents.

Another solution might be to apply some form of cascading order priority, where column order of the first frame is preserved, then new columns introduced by the second frame have their order preserved, and so on. Although this is more complex, my intuition is that it's probably less expensive than a full sort, as well as being more intuitive. This seems to be the approach pandas takes, for what it's worth:

f1 = sf.Frame.from_element(1, index=[0], columns=['c', 'a', 'b', 'y'])
f2 = sf.Frame.from_element(1, index=[0], columns=['c', 'b', 'x', 'd', 'a'])

df1, df2 = (x.to_pandas() for x in (f1, f2))
pd.concat((df1, df2), sort=False, ignore_index=True)
   c  a  b    y    x    d
0  1  1  1  1.0  NaN  NaN
1  1  1  1  NaN  1.0  1.0

Jul 16 '20 17:07 ForeverWintr