Request: Simplify cumulative_returns definition.
Description
On Quantopian, Alphalens has become more of a central figure as we have been running challenges where the submissions are made as Alphalens tearsheets. In most of these notebooks, the slowest step of running the notebook from top-to-bottom is generating the full Alphalens tearsheet. Recently, I did some profiling of create_full_tear_sheet to see if there were any relatively simple opportunities to speed it up. One component that immediately popped up was cumulative_returns, which was taking ~65% of the total time of running a full tearsheet, and > 75% of create_returns_tear_sheet.
I took a look at the cumulative_returns definition, and I was a little surprised by the complexity. Digging into it a bit, it seems the most of the complexity is a product of supporting periods that are smaller (faster?) than the frequency of the provided returns data. I'm wondering if it would make sense to simplify the implementation of cumulative_returns and either drop support for the case where the period is less than or different from the returns data.
High-Level Suggestion
Without knowing more about the community that uses Alphalens outside of Quantopian, my first suggestion would be to drop support for the case where the period is less than the period of the returns data. I was surprised that this case was supported, mostly because I didn't realize that cumulative_returns was using interpolation to fill in data points that were required to compute the cumulative return for the specified period. Additionally, I think it might be a good idea to leverage the cumulative returns function in empyrical so that results are more likely to line up with other quant finance projects/tools.
By dropping support for the case where the period is less then the period of the returns data and by implementing cumulative_returns in terms of cum_returns in empyrical, my expectation is that it will become easier to optimize the function for performance.
If there's still a desire to support computing cumulative returns with interpolated returns data, maybe it could be split into a separate function. I read through the code and I don't think I fully understand the current implementation, but I understand that this might be an important use case to some folks.
Reproducible example: Download link: al_sample_data.csv
import pandas as pd
import cProfile
import pstats
from alphalens.tears import create_returns_tear_sheet
# Include if running in a Jupyter notebook.
%matplotlib inline
al_inputs = pd.read_csv('al_sample_data.csv', index_col=['date', 'asset'], parse_dates=True)
def run_returns_tear_sheet():
create_returns_tear_sheet(al_inputs)
p = cProfile.Profile()
p.runcall(run_returns_tear_sheet)
p.dump_stats('returns_tearsheet_profile.stats')
stats = pstats.Stats('returns_tearsheet_profile.stats')
stats.sort_stats('cumtime').print_stats(20)
Output (note that actual runtime varies quite a bit between runs/machines, but the % breakdown of cumtime by function remains roughly the same):
Fri Jan 31 10:09:55 2020 returns_tearsheet_profile.stats
62029996 function calls (61402118 primitive calls) in 130.569 seconds
Ordered by: cumulative time
List reduced from 3735 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 130.587 130.587 <ipython-input-1-eb0a21d53746>:10(run_returns_tear_sheet)
1 0.001 0.001 130.587 130.587 /Users/jmccorriston/quant-repos/alphalens/alphalens/plotting.py:38(call_w_context)
1 0.038 0.038 130.566 130.566 /Users/jmccorriston/quant-repos/alphalens/alphalens/tears.py:165(create_returns_tear_sheet)
6 0.863 0.144 98.381 16.397 /Users/jmccorriston/quant-repos/alphalens/alphalens/performance.py:332(cumulative_returns)
1 0.000 0.000 80.395 80.395 /Users/jmccorriston/quant-repos/alphalens/alphalens/plotting.py:757(plot_cumulative_returns_by_quantile)
7 0.000 0.000 80.307 11.472 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/frame.py:6737(apply)
7 0.000 0.000 80.298 11.471 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/apply.py:144(get_result)
7 0.001 0.000 80.297 11.471 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/apply.py:261(apply_standard)
11 0.000 0.000 80.247 7.295 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/apply.py:111(f)
7 0.000 0.000 56.973 8.139 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/apply.py:297(apply_series_generator)
13608 0.093 0.000 47.133 0.003 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/series.py:1188(__setitem__)
13608 0.069 0.000 46.881 0.003 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/series.py:1191(setitem)
4536 0.136 0.000 46.233 0.010 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/series.py:1261(_set_with)
4536 0.469 0.000 45.371 0.010 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/series.py:1303(_set_labels)
22715/18179 0.485 0.000 43.101 0.002 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/indexes/base.py:2957(get_indexer)
4541 0.082 0.000 31.554 0.007 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/indexes/datetimelike.py:686(astype)
4536 0.035 0.000 30.160 0.007 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py:706(astype)
4541 0.054 0.000 29.979 0.007 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/arrays/datetimelike.py:516(astype)
4541 0.023 0.000 29.849 0.007 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/arrays/datetimelike.py:346(_box_values)
4548 3.760 0.001 29.825 0.007 {pandas._libs.lib.map_infer}
Versions
- Alphalens version:
0.3.6 - Python version:
3.7.5 - Pandas version:
0.25.3 - Matplotlib version:
3.1.2
@luca-s - I'd love to get your thoughts on this!
@jmccorriston I believe you can safely go on with this change and simplify cumulative_returns. Back in time, the cumulative returns code was something like daily_returns.add(1).cumprod().plot(...) , which is pretty fast. The result is an approximation of cumulative returns that works well for 90% of the user cases (I believe).
Just be aware that if you go back to that implementation you will lose the ability to (correctly) compute cumulative returns for:
- Periods longer than 1 day. It won't make sense to plot factor cumulative returns for any period which is not 1 day (see paragraph "Note on cumulative return plots" in here
- Factor data with gap
- Factor data with variable frequency
- Event study
I believe all of the above is fine, as you are interested in daily factors.
Thanks for the quick resopnse, @luca-s !
To be clear, when you say that such a change would lose the ability to compute cumulative returns for periods longer than a day, do you mean weekly/monthly/etc factor data? I could definitely be wrong about this, but I was under the impression that factors with slower periods aren't yet supported given the requirement that the freq of the input's DateTimeIndex has to be Day, BDay, or CDay.
Do you have an example that runs with a slower period? My guess is I'm just misinterpreting the meaning of 'period' in your explanation.
I took a read through the tutorial and did another pass over the code and I think I understand the limitation. I think it's important to support the use case where the period > 1 day. I'll have to dig a bit more into the rate limiting steps in the cumulative_returns function to see how we can speed things up while still supporting this use case.
In the meantime, would it be sufficient to reframe the solution as taking the mean of the next N N-day cumulative returns to achieve a similar (same?) result as the subportfolio technique? For instance, if my factor is daily but I want the 5D returns, could I take the next 5 5-day means and average them? Apologies if this is the same as the current implementation. I'm trying to think about how we might be able to express this as a rolling computation instead of iterating over subportfolios (in case this makes things faster).
@jmccorriston my previous reply was not totally correct, but the matter is quite subtle and I didn't want to enter too much into the details...but I will do it now ;)
Initially Alphalens supported only daily data and it had the assumption that factor_data was a daily frequency dataframe (actually trading day frequency: no weekends + public holidays) and prices dataframe followed the same assumption. Also periods was assumed to mean days (e.g. periods=(1,3,5) meants 1 day, 3 days and 5 days returns). Finally the cumulative returns was plotted only for 1 day period.
Given those assumption the code `daily_returns.add(1).cumprod().plot(...) cumpute correctly the cumulative return (almost, the returns are reported one day earlier than they should. So Monday returns are plotted on the previous Friday, Tuesday returns are reported on Monday and so on. This is "ok" if you assume contiguous daily data, it's just a 1 shift error).
The current code doesn't have any assumptions on factor_data frequency. factor_data doesn't even have to have any frequency at all (like an event study based factor). Also prices dataframe doesn't have to have the same index as factor_data, it can have N prices for each entry in factor_data (e.g. look at this intraday factor )
Because of the above generalization the code became very complex.
If you like to simplify the cumulative_return function to daily_returns.add(1).cumprod() than it will not be depending on the period variable, that's it. It will still work with any factor frequency (daily, weekly, montly, intraday) but It will not compute cumulative return for periods longer than the factor frequency (in that case you would need to compute parallel portfolios and merge them).
I know it is tricky and maybe you are right in removing these bits of code even if it loses generalization. Let me know if you need help with code internal details. I have a rough idea of what needs to be changed to simplify the cumulative_return function
In the meantime, would it be sufficient to reframe the solution as taking the mean of the next N N-day cumulative returns to achieve a similar (same?) result as the subportfolio technique? For instance, if my factor is daily but I want the 5D returns, could I take the next 5 5-day means and average them?
Unfortunately it is mathematically not identical. I don't know if it can work as an approximation though
Thanks for the extra detail, Luca! I plan to take a crack at this on Tuesday next week. My plan is to try to implement it in terms of the definition in [empyrical](cumulative returns function in empyrical), and possibly address the off-by-one error that you were describing above. I'm an average coder at best so I'll ping you when I make progress in case I'm heading in a different direction from what you're envisioning.
@luca-s I spent some more time thinking about this and poking around the code base today. My tentative plan is to move some of the sub-portfolio logic into the utils module (is that the right technical term?). The way I think about it is that the performance modules is responsible for computing metrics, the plotting module is responsible for plotting those metrics, and the tearsheet module groups sets of metrics and plots into 'analyses'.
My experience using Alphalens so far gives me the expectation (as a user) that everything in the performance module should take appropriately formatted factor and forward returns data as input. Any functionality or tooling that aims to get user data into the appropriate format for functions in the performance module should exist in utils (this was inspired by the fact that get_clean_factor_and_forward_returns and friends exist here). This way, the functions in performance can make stronger assumptions about the structure and content of the input data. Does that make sense to you?