APIs for both building pipelines and data analysis
From my own experience, there are (at least) two very different use cases to use dataframes:
- Doing some real-time data analysis in notebooks
- Building production pipelines
While I think pandas (and later Vaex, Dask, Modin...) did a very reasonable job at building a single tool that solves both use cases. There are trade offs, that IMO will bias any API towards one or the other.
Some specific examples:
- Eager/lazy modes: In case 1 (data analysis) eager mode is probably preferable, while in case 2 (pipelines), lazy mode has more advantages
- Automatic type inference/casting: I see the advantage of having all sorts of magic on inferring types and guessing when the user is able to to check the results of the operations at every step. But when instead of executing cell by cell, there is a big pipeline that is executed as a batch process, I see this problematic. I think it's worth having to be explicit and avoid any magic. It helps prevent bugs, and the errors are not propagated to later stages in the pipeline, making them difficult to identify.
I wrote a post about this that describes this point of view in more detail.
I think it can help the discussions to keep in mind that there are at least two main use cases, and that there will be trade offs among them.
Feedback here very welcome.
This is a great point @datapythonista, and type casting behaviour is probably the best example of where you want different semantics for the same functionality.
Eager vs lazy really is a separate topic I think that's independent of API (it's about performance, not semantics). Also for developing production pipelines, you want to develop using eager - it's simple the more intuitive way to develop code. And then you want to flip a switch to get lazy behaviour for possible performance optimizations in production. This is exactly what the PyTorch/Tensorflow experience showed (Tensorflow 2 switched to eager by default, and both libraries have a graph compiled mode for production use cases). It's completely analogous for dataframes, except libraries are a little less mature. At this point I'd suggest keeping eager/lazy out of this topic. @teoliphant volunteered to write up a summary of eager/lazy as a standalone topic.
I mostly agree, but I think the API may be affected. With an example:
import sql
query = sql.connect('foo.db')
query = query.select('*')
query = query.from('table1')
query.execute()
This is obviously a lazy API. In eager mode the line query = query.select('*') is meaningless. So, if you want to make this eager, the API needs to be changed.
So, IMO, a decision needs to be made before defining an API among:
- Lazy
- Eager
- Hybrid (the same syntax can be executed both in lazy and eager modes)
- Mixed (some things are executed in eager mode, some in lazy mode, see @maartenbreddels comment below)
May be we're already assuming hybrid is the way to go. But I don't think it's the only option (not saying is not the best).
I would like to add that vaex sits somewhere in between:
import vaex
df = vaex.example()
# lazy element wise operations
df['y'] = df.x**2 # just the expression gets added
display(df['y']) # will compute/show only the the first and last 5 rows
# lazy filtering
dff = df[df['x'] > 1] # no explicit compute() needed
# half lazy join
df.join(df, 'x', rsuffix='r', allow_duplication=True) # eagerly calculates the lookup, lazily fetches the values
So in usage, it seems eager, but a lot is done lazily behind the scene.
I think my point is, that even if you have an eager acting system, it can do a lot of things lazily, and it could build computational graphs/DAG's/plans behind the scene to automatically generate pipelines for instance (like what we do in Vaex).
Thanks @maartenbreddels, that's a very good point. I clarified my comment above to note that.
My main point is that there may be API implications on the decision made regarding laziness. If we totally forget about it, and decide once the API is finished, there will be limitations. May be they are not important enough to not postpone the discussion, and keep things simpler for now. But I don't think the user API and eagerness are fully independent topics.
My personal opinion here is that adding explicit compute or execute leaks execution details to the user. Separating the execution details from the API is going to be important because systems may want to control when computation is actually triggered, like in the case @maartenbreddels showed with Vaex.
I think we can get away without needing some explicit computation trigger word, lazy systems can just trigger computation when an API calls for the dataframe to be converted to an external format (e.g. array, csv file, __repr__, etc.). This is how laziness is handled in Modin (preliminarily).
In Ibis automatic lazy execution is configurable:
https://github.com/ibis-project/ibis/blob/master/ibis/expr/types.py#L25
If you __repr__ an expression object in a Jupyter notebook, for example, then it executes automatically (rather than having to type expr.execute()). I think it's important to support to support both modes since you don't want your library firing off expensive queries to BigQuery by accident.
Interesting to hear Devin & Wes. I wasn't totally sure we could get away with it, but if 3 libraries are doing this already (Vaex, Modin and Ibis), I guess it is demonstrated to be possible.
Then I guess the next question is: Should lazy execution be something we should consider explicitly in the API design? Or can this be an implementation detail of the libraries?
This is one of the central questions of this whole project: does the API expose execution semantics or not? The pandas API in general exposes eager execution semantics. This is a significant decision with many implications. My design work with Ibis was an exploration of a data frame library without exposing any eager execution semantics.
In R people use both kinds of APIs, ones that expose execution semantics (base R, data.table) and ones that don't (dplyr).
I don't necessarily agree that the pandas API itself is what exposes eager execution semantics, instead it is the implementation. It is not too difficult to intercept API calls under __getattribute__ and build a DAG (this is what Modin does). Doing a print(df.expensive().head()), users don't care if the entire expensive is fully executed before they see the top 5 lines, especially in EDA/data cleaning.
I don't think we should support an explicit execute or compute or persist in the API. It leaks too much to the user. I would prefer to have full control over when things are executed, not rely on the user to give their best guess about when there is enough information. Ideally users should not have to think about execution, it should just work.
It is not too difficult to intercept API calls under getattribute and build a DAG (this is what Modin does).
IMHO, just because you can do this does not mean that you should. If the barrier to entry of implementing the API is a significant about of metaprogramming (regardless of whether there is a example implementation of it or not), that does not strike me necessarily as a good thing.
I don't think we should support an explicit
executeorcomputeorpersistin the API. It leaks too much to the user. I would prefer to have full control over when things are executed, not rely on the user to give their best guess about when there is enough information. Ideally users should not have to think about execution, it should just work.
@devin-petersohn What about the pipeline execution? Don't you think that having the API that allows more granular control, or even execution scheduling of various portions of DAG is beneficial - should the DAG be exposed as a work item, like, e.g. in Airflow?
IMHO, just because you can do this does not mean that you should. If the barrier to entry of implementing the API is a significant about of metaprogramming (regardless of whether there is a example implementation of it or not), that does not strike me necessarily as a good thing.
@wesm I think you misunderstand the point I was trying to make. The __getattribute__ approach I mention is just one way to handle building the DAG, you can also handle this on an API to API basis (as @maartenbreddels mentioned he does with Vaex) or implement each API to add itself to the DAG or be completely eager. My point was the pandas API itself does not imply eager.
@devin-petersohn What about the pipeline execution? Don't you think that having the API that allows more granular control, or even execution scheduling of various portions of DAG is beneficial - should the DAG be exposed as a work item, like, e.g. in Airflow?
@aregm I am imagining each dataframe system as a black box, similar to SQL systems. Exposing the actual DAG also comes with a new set of requirements and would require some additional API standards. I expect there is another can of worms involved in exposing this level of execution detail as well because some systems may not want to do it at all. I'm interested to hear your thoughts.
My point was the pandas API itself does not imply eager.
I don't know about this. By the same logic, no library has eager evaluation so long as API calls can be intercepted and translated into an AST.
Consider this pandas code
df[col][df[col].isnull()] = 5
result = df.groupby(col)[col].sum()
Yes, you can transform this into:
df <- Mutate(
df,
AssignWhere(
ColumnRef(df, col),
IsNull(ColumnRef(df, col))
Literal(5)
)
)
result <- GroupBy(
df, ColumnRef(df, col),
Aggregate(ColumnRef(df, col)))
But was that the intention of the library creators? The execution semantics (especially when dealing with very large data sets) influence a great deal how people write and think about the code.
just because you can do this does not mean that you should
I agree those two are separate questions. The can question is answered with yes.
I don't think we should support an explicit
executeorcomputeorpersistin the API.
For a data science-user, I agree. Aas a programmer/engineer-user I am not sure which is the right way.
Yes, the "can" vs. "should" distinction is a good one :)
So I think we're back to Ralph's original point that eager vs. lazy is primarily about performance, rather than API? Does anyone object to that?
@datapythonista you noted in https://github.com/pydata-apis/dataframe-api/issues/5#issuecomment-630136503
My main point is that there may be API implications on the decision made regarding laziness. If we totally forget about it, and decide once the API is finished, there will be limitations.
Can we delve into that a bit? In my experience, this is primarily an issue when we have value dependent behavior, i.e. the output dtype / shape depends on the values in the dataframe, rather than just the input dtype / shape. We've been removing this type of behavior in pandas wherever possible, but in some cases it's unavoidable (e.g. pivot(df, index="a", columns="b") will for a DataFrame whose columns are the values in b). In practice, this just means that lazy systems will need to evaluate the values in b before proceeding with the rest of the computation.
I think that if we agree that eager vs. lazy is (primarily) about performance then we can
- Rule out a
.compute/.persistAPI as part of the standard - Try to avoid APIs that aren't sensible in either a lazy or eager system
The "can" vs "should" discussion is a bit of a derailment, not sure why we went there. This is not the place to discuss the merits of how implementations handle things, and was an illustrative example. There's a better place for this type of discussion.
I don't know about this. By the same logic, no library has eager evaluation so long as API calls can be intercepted and translated into an AST.
Yes, this is my point. API and the underlying evaluation should be distinct and separate. It's not that no library has eager evaluation, it's that the API should not imply anything about execution.
But was that the intention of the library creators? The execution semantics (especially when dealing with very large data sets) influence a great deal how people write and think about the code.
You are the library creator in this case, so I will not try to guess your intention 😄. I think intention is not really the point. I think you make a great point here about how the scale of data influences how people write and think about code. I think this is a signal that there has been a failure in the design of the APIs of recent data systems. When I write SQL it will run on 10 rows or 1 trillion rows, I don't think about the size of the data in constructing the query. This is something I think modern data systems have mostly gotten wrong, that deep details of execution and data layout are pushed onto the user as points that they individually have to optimize for their particular workflow.
I don't think we should support an explicit
executeorcomputeorpersistin the API.For a data science-user, I agree. Aas a programmer/engineer-user I am not sure which is the right way.
As a general point: we don't have to figure out every last issue at once. I'd definitely punt on this in the first draft of the spec. One can use execute/compute methods, or decorators, or context managers, or whatever to express you want to do something lazy, or JIT-compiled, or on a GPU, or whatever. There'll be a number of such "out of scope for now" things.
the API should not imply anything about execution.
I think there's general agreement on adopting this principle.
- Rule out a .compute / .persist API as part of the standard
- Try to avoid APIs that aren't sensible in either a lazy or eager system
@TomAugspurger I like that approach!
If we assume number 1, then this seems to suggest a number of things to avoid which would be awkward in lazy mode, addressing your number 2:
-
Avoid returning concrete Python types (int, list, bool, etc), instead return protocols/monkey types. Otherwise, we have to materialize to concrete Python types instead of keeping things lazy. So instead of saying "ndarray.shape returns a tuple of integers" we could say "ndarray.shape returns an object with a
__getitem__that takes and returns integers and a__len__". Obviously would need to be a bit more complete than this. - Being clear about view/copy semantics. This would help for @wesm's example. So we should be clear for every argument to a function whether it is mutated or copied.
-
Building in control flow, so we can avoid Pyton control flow. This is already the case in lots of Pandas and NumPy (i.e.
np.where), so shouldn't really be a problem. Point being that any Python control flow breaks lazy mode, because you cannot override it.
Anything else I am missing?
I just want to make sure I follow the conversation, if we're talking about 'user' here, who is that?
As an author of a library consuming the dataframe, I feel that I need to know about the execution semantics to write efficient code.
When I write SQL it will run on 10 rows or 1 trillion rows, I don't think about the size of the data in constructing the query. This is something I think modern data systems have mostly gotten wrong, that deep details of execution and data layout are pushed onto the user as points that they individually have to optimize for their particular workflow.
I am very inexperienced with SQL systems, but when I write numerical code, the exact opposite is true.
I just want to make sure I follow the conversation, if we're talking about 'user' here, who is that?
Any code or library that consumes a dataframe I believe.
As an author of a library consuming the dataframe, I feel that I need to know about the execution semantics to write efficient code.
When you say "execution semantics" could you give an example? To me, the user (in this case you, a library author) should depend on the semantics of the operation, but should not on it having particular performance characteristics, because those could be very implementation dependent.
Well, I would want to know what is lazy and when something gets evaluated, for one, or whether something is a view or a copy, or potentially peak memory usage. I guess it depends a bit on what level of API we are talking about. sklearn partially works on C-pointer level, which will never be implementation independent.
However, there are parts (like scaling and one-hot encoding) that could reasonably be implemented on an abstract API. But even then, the implementation will probably depend quite a lot on the performance characteristics.
Well, I would want to know what is lazy and when something gets evaluated, for one, or whether something is a view or a copy, or potentially peak memory usage.
I would like to understand why you want know these things. As the consumer of the result of a dataframe query, why is it important to know these details, outside of performance?
View vs copy is about semantics (it e.g. changes what a subsequent in-place operation does), so that's clearly something that must be defined.
I guess it depends a bit on what level of API we are talking about. sklearn partially works on C-pointer level, which will never be implementation independent.
Right, I think your questions are much more critical for low-level code. For a high level API it can still matter of course, however in many cases it doesn't. For example, Dask is lazy and does a pretty good job of providing APIs that match NumPy and Pandas; a lot of code consuming that API doesn't change when you use Dask.
For C/Cython, let's continue on that other thread you brought it up, there was more discussion there.
As the consumer of the result of a dataframe query, why is it important to know these details, outside of performance?
I think I get @amueller's concern: from the point of view of a library like scikit-learn, one wants to provide performance (memory, computation speed) optimal algorithms. And those are more often than not data structure implementation dependent.
I think the answer to that is: a standard API will still help, it makes different array/dataframe libraries run. For really getting optimal performance, one will then still have to add custom code paths. Whether that is worth doing is a performance vs. maintenance trade-off.
Honestly, I have a hard time even conceptualizing an iterative algorithm that uses lazy evaluation. If you look at tensorflow or pytorch, the optimizers work on the computational graph, i.e. they deeply inspect the internals and computational structure.
Tensorflow and pytorch basically implement one optimizer: stochastic gradient descent. Scikit-learn implements maybe 50 or so. If you have examples of how it looks like to implement iterative optimization with lazy data structures, I'd love to see it.
@rgommers sorry didn't see your reply, needed to reload the page. The low-level question is better discussed in the other thread maybe, but the question about how to write / think about iterative algorithms is somewhat separate to me. If people think it's possible to write this without knowing what triggers a compute in a lazy graph, I'm happy to be convinced, I just have a hard time wrapping my head around it.
I think this is a great point and am glad @datapythonista brought it up. I have seen this distinction with users of NumPy too. In that context, you have "interactive usage" vs. "library usage". Initially, NumPy was primarily used for interactive use. Over the years, it has become more used as a library. The personas of the people that rely on the tool want different things from the API depending on their primary use-case.
For interactive use, people want time-saving help and they want more "intelligence" from the tool like run-time hiding, robustness to inputs, and very little typing information.
For library use, developers typically want more clarity and less "magic" (where "magic" is a bit in the eye of the beholder in terms of what they are used to). In my experience, developers want type information, quick errors when inputs are wrong rather than robustness, and more exposure of run-time details they can take advantage of for speed.
For Python, we have both use-cases interweaving all the time. In practice, in my view, that means there are always at least 2 API categories: One that is more dev-centric and another that is more user-centric. For clarity, and now that packaging is in a better situation, two libraries are often more effective than one.
So, as we think about APIs, I agree in eventually separating them into at least 2 categories. But, I recommend we do that only as it becomes clear we need to because of different spec requirements for the different APIs that the data tell us are in regular use.
The low-level question is better discussed in the other thread maybe, but the question about how to write / think about iterative algorithms is somewhat separate to me. If people think it's possible to write this without knowing what triggers a compute in a lazy graph, I'm happy to be convinced, I just have a hard time wrapping my head around it.
Yes I think you're right, iterative algorithms are a bit out of the scope of this discussion - you can't just make those lazy. That mainly applies to a subset of array-based code typically written by library authors though; most end user code for both arrays and dataframes, and a good amount of library code as well, can be lazy just fine (otherwise Dask wouldn't get away with just copying the NumPy and Pandas APIs for example).
Hm saying iterative algorithms are out of scope bascially means most downstream libraries for numpy are out of scope, right? So you're basically saying any machine learning stuff is out of scope. If that's the framing that people want, I guess that's fine. Everybody just needs to be aware that this means sklearn is completely out, any probabilistic modelling and forecasting is out, any optimization, portfolios and planning etc...
So it seems like we want to hit a sweet spot with this proposal to enable lazy usage, but also support existing users who write their algorithms imperatively. And along @teoliphant's point about multiple APIs, a high and a low level one, it might make sense for the high level one to try to mirror almost exactly some subset of NumPy's semantics and the lower level one to expose some more functional primitives, to enable higher performance lazy usage.
If people think it's possible to write this without knowing what triggers a compute in a lazy graph, I'm happy to be convinced, I just have a hard time wrapping my head around it.
@amueller IMHO this is implicitly defined by Python, if we assume that users of this API will be evaluating it under a default Python interpreter, not through a tool like Numba or Pythron, which seems like a reasonable assumption. Any builtin dunder methods that must return Python types have to be "eager", i.e. __len__, __bool__, __iter__... Which means if you have a lazy implementation any of those methods need to actually do enough computation to return an accurate result. If they do that, then you are free to continue using the non-functional algorithms you listed in sklearn on lazy backends, they just would end up evaluating quite a lot. There are some other methods, like __str__ or __repr__, that might not be specified in the standard, which means that lazy backends wouldn't have to evaluate on those if they did not want to.
If there are functional ways of representing the iterative algorithms present in the array APIs we settle on, it might then be feasible for sklearn to move to them if they wanted a backend to able to more holistically compile the algorithm, but they wouldn't be forced to.
Honestly, I have a hard time even conceptualizing an iterative algorithm that uses lazy evaluation.
Forcing all users to switch to fully lazy versions of iterative algorithms is definitely out of scope for this group, but I did want to highlight some prior work here: https://futhark-lang.org/ https://www.acceleratehs.org/index.html. Also see the work Google is doing in autograph, to convert Python code in imperative forms into a functional XLA graph.("AUTOGRAPH: IMPERATIVE-STYLE CODING WITH GRAPH-BASED PERFORMANCE"):
Stella Laurenzo, who is at Google, is also working along this line, trying to see how to build a Python frontend for a NumPy MLIR dialect("npcomp - An aspirational MLIR based numpy compiler").
To take a step back, what we need is a way for users to express what they want (in this case sklearn) and for the backend/implementation to execute that in a way that makes sense for them. The "issue" is that the way users want to express things is through iterative algorithms using Python's built in control flow. So from a high level, either that means a) stop relying on Python's control flow and express everything in functional form in the DSL of the library (how TF worked before autograph, more similar to Futhark or Accelerate, but embed a DSL embedded Python) or b) find ways to translate Python's control flow into a form that the backend can control (numba or autograph), so that users can continue using Python control flow but have it be performant. Cython is a shortcut around option b) that works for a particular backend.
But at the moment, as I said above, I think we can avoid making that choice here, if we support an outlet valve/fallback to support the existing APIs that allow imperative execution.
So what you're saying is that the user/library should have some control over whether something is evaluated lazily, which I think would be the easiest solution for supporting iterative algorithms, but which seemed to go against what @devin-petersohn was imagining, I think.
I don't have a strong opinion (or expertise) to come up with solutions here, I just wanted to bring up the importance of iterative algorithms in my part of the data science world and that I'm not sure how that would relate to lazy evaluation.
It sounds to me like we do need to make some choice here on how much control the user has.
.... so that users can continue using Python control flow but have it be performant.
Like you said, that's definitely out of scope. This is a really hard problem where all solutions involve compilers and are immature for the foreseeable future. For a standard or even simply adoption in something like scikit-learn, we're talking many years (if ever).
It sounds to me like we do need to make some choice here on how much control the user has.
I think very little or nothing is needed in the API here. It just doesn't apply to iterative algorithms - those are all in Cython/C/C++/Fortran.