Data frame: Typed frame vs. type erasure in practice?
I am looking at #174 with regards to the suggested typed-frame. I am assuming this would essentially be something like
template <class... Ts>
using xframe = std::map<std::string, std::variant<Ts...>>;
I think I really like the idea of having a std::variant here. In combination with std::visit implementing algorithms for this looks very convenient. :+1:
I have a couple of concerns however, which may be rooted in my lack of understanding of what you are proposing, i.e., do not see this as a criticism of the approach per se:
-
Does every custom algorithm or function that takes and
xframeas an argument have to be templated onTs...?- In general, a frame may contains tens of variables, not all of which are just plain
doubleorint64_t, i.e., we have a combinatoric explosion of the number ofxframetypes. - Many algorithms will not care about most of the other types present in the frame, do they nevertheless need to depend on (template on) the types of those?
- Doesn't this lead to absolutely massive code size? ... and compilation times?
- Does adding a new type to a frame require recompiling the whole codebase, including all libraries built on
xframe? - Can binaries of library be shipped (or rather, would it be useful at all)? If some custom code adds a custom type to the frame that is not known to the any of the algorithms in a library, it implies that the algorithm cannot be used with such a frame (even if the custom type is irrelevant to the algorithm).
- In general, a frame may contains tens of variables, not all of which are just plain
-
Considering a Python interface:
- Do we need to instantiate all possible combinations of types and have separate Python exports for all of them?
- Do we furthermore need to instantiate and export all functions/algorithms for all possible combinations?
-
How is compatibility between two
xframeobjects handled? We need to support, e.g.,xframe::operator+=(const xframe &other).- We definitely must support
otherthat actually has a different content that*this(within certain limits). For example,*thismight contain some additional variables or coordinates that are not present inother. If the remaining coordinates and variable names match, an operation is still possible and should be supported. - For implementing a custom algorithm that takes, e.g., two input frames, we would need to pass two parameter packs to that algorithm, so we would have something like
Am I missing something? This looks quite complicated. While certainly doable for library-internals, it looks quite complicated when targeting an average C++ developer. Is there a way to avoid this (except for having a may too generictemplate<class... As, template <class...> class A, class... Bs, template <class...> class B> void myAlg(A<As...> &frameA, const B<Bs...> &frameB) { /*...*/ }template <class F1, class F2> void myAlg;)? Again, I am also concerned about the number of instantiations. I assume it could quickly reach hundreds or even thousands in practice when supporting two input frames (at least when doing explicit instantiations such thatxframeis usable from Python)? Supporting three or more input frames seems totally impossible, unless maybe we just have a singlexframetype for a variant with all possible supported types (not sure this would still be possible with expressions then, since adding a set of expressions to the list of supported types seems unrealistic?)?
- We definitely must support
-
Is it possible to provide an intuitive API for the frame type, given that adding/removing variables may change the type?
- #174 suggests
operator|, which is very different from Python where we would have something likeframe["data1"] = variable--- is such an asymmetry between C++ and Python a problem?
- #174 suggests
Just to add another complication into the mix: For handling physical units, something like boost::units would fit well into the picture (compile-time unit checking based on the type system). If every xf::variable has a unit (in addition to data, dimensions, and coordinates), the number of distinct types to support grows even more.
1. Variant vs type-erasure
It depends on the level of genericity / type control you want to give to your API; the function can accept a very generic type, a frame type (in that case it has to be templated), or be enabled depending on some requirement on its generic template parameter:
template <class F>
auto do_something(const F& frame);
template <class... Ts>
auto do_something(const xframe<Ts...>& frame);
template <class F, XTL_REQUIRES(is_xframe_expression<F>)>
auto do_something(const F& frame);
The combinatoric explosion of the number of types is the drawback of the variant. That's why the code for dynamic variables have not been removed, and I think that at some point xframe will provide many frame types, one supporting dynamic variables and type erasure. This will also avoid the need to recompile all the code depending on a frame type that has changed.
I see variant and type erasure as complementary approaches rather than competitors. The first one gives you more static checking and performances, the second one more flexibility. This first one is easier and faster to implement though, since it does not require the implementation of a dynamic expression system.
2. Python Interface
I'm afraid that we don't have the choice here. So the idea is to limit the number of type we expose (only frames on variable containers, not expressions, with a limited number of scalar types). Dynamic variables can also be of help here, although we probably don't want to completely erase the scalar types of the data. A dynamic expression system would allow to expose arithmetic operators and so on, and keep the lazy evaluation; without such a system, we need to expose operators that evaluate and return their result.
3. Compatibility between frame objects
I agree that xframes involved in an operation could have different types, however in practice you will align the frames before the operations (for performance considerations), so there should not be a lot of different frmae types (the difference should be on the scalar types embedded in the different frames).
All expressions in xfrmae (as in xtensor) provide an expression_tag type member, so it is possible to enable / disable an operator with an expressive syntax (thanks to some utilities provided in xtl):
template <class FT1, class FT2, XTL_REQUIRES(is_frame_expression<FT1>, is_frame_expression<FT2>)>
inline auto my_operation(const FT1& frame1, const FT2& frame2)
{
}
Regarding operator+=, I don't see any problem, since the types of the frames are the same. Nothing prevent from inserting a new variable in an existing frame as long as its type is already included in the frame type.
4. Intuitive API
The operator| won't directly return a new frame, but rather a frame builder. A "terminal" object will be required to trigger the build of the frame (this avois computing union of coordinates each time you add a variable).
Regarding the addition / removal of a variable to / from an already existing frame, I think it's totally possible to provide an intuitive API for a C++14 user:
auto new_frame = existing_frame.add("name", var);
Whether variables are moved from one frame to the other, making existing_frame invalid after the operation, has to be discussed. We can also provide both options with some policy / flag.
I think we won't be able to totally mimic a Python API on the C++ side (except for the frame with dynamic variables), however if we restrict the number of types exposed to the Python, building such an API on top of the more generic one should be quite straightforward.