CLVTools icon indicating copy to clipboard operation
CLVTools copied to clipboard

FAQs for working with CLVTools

Open mmeierer opened this issue 4 years ago • 10 comments

Draft for FAQs on CLVTools based on questions that were submitted to the team in the past months.

This is issue collects common user questions as well as possible answers. Later we will sett up a separate page for this on our webpage.

mmeierer avatar Jun 16 '21 07:06 mmeierer

How does one “predict” for times into the future, the code in the walkthrough seems to only predict vs a holdout that is already known and therefore not really forward looking.

To use all data for your model please specifify NULL for the argument estimation.split when setting up the clvdata() object: https://rdrr.io/github/bachmannpatrick/CLVTools/man/clvdata.html:

"estimation.split May be specified as either the number of periods since the first transaction or the timepoint (either as character, Date, or POSIXct) at which the estimation period ends. The indicated timepoint itself will be part of the estimation sample. If no value is provided or set to NULL, the whole dataset will used for fitting the model (no holdout sample)."

This model uses all available information for a cohort. Then, to predict in the future use the argument prediction.end in the predict() command. Please see this link for more information https://rdrr.io/github/bachmannpatrick/CLVTools/man/predict.clv.fitted.transactions.html:

"prediction.end indicates until when to predict or plot and can be given as either a point in time (of class Date, POSIXct, or character) or the number of periods. If prediction.end is of class character, the date/time format set when creating the data object is used for parsing. If prediction.end is the number of periods, the end of the fitting period serves as the reference point from which periods are counted. Only full periods may be specified. If prediction.end is omitted or NULL, it defaults to the end of the holdout period if present and to the end of the estimation period otherwise.

The first prediction period is defined to start right after the end of the estimation period. If for example weekly time units are used and the estimation period ends on Sunday 2019-01-01, then the first day of the first prediction period is Monday 2019-01-02. Each prediction period includes a total of 7 days and the first prediction period therefore will end on, and include, Sunday 2019-01-08. Subsequent prediction periods again start on Mondays and end on Sundays. If prediction.end indicates a timepoint on which to end, this timepoint is included in the prediction period."

mmeierer avatar Jun 16 '21 07:06 mmeierer

What is the difference between CET and DERT?

The discount factor is one of the driving factors for the difference between CET and DERT. For example, assuming a yearly discount rate of 5% (10%),the present value of 1 USD in year 5 is 0.78 USD (0.62 USD). Thus, besides the discount factor value, the length of the prediction period matters when looking at the difference between CET and DERT.

See discussion in https://github.com/bachmannpatrick/CLVTools/issues/166

mmeierer avatar Jun 16 '21 08:06 mmeierer

How to evaluate performance of a CLV model?

Let's first look at a a brief explanation how CLV is defined for the latent attrition models in CLVTools:

T)d(t-T)dt%7D" rel="nofollow" target="_blank" >

As you can see, CLV is from estimation end (T) until infinity, and accounts for the probability to be alive (S), the number of future transactions (t(t)) and discounts all of this (d) (so that less and less emphasize is put on timepoints further away).

The expression in the integral is called DERT while the E(Spending) is the mean spending per transaction and assumed to be constant in the future. CLV therefore is calculated as DERT * predicted.mean.spending.per.transaction. DERT is an output form the pnbd() model and predicted.mean.spending is from the gg() model which by default is automatically fit when predicting.

See ?predict.clv.fitted.transactions for an explanation of what the different columns in the predictions are.

As you can see from the above definition, CLV as a concept is hard to evaluate because you cannot actually observe a customer until infinity (not even customer "death").

But you can evaluate it on the holdout data to get an impression of how good your estimate is. You can compare (1) the number of predicted transactions vs actual number of transactions or (2) the predicted value of a customer during the holdout period vs actual value. (1) The predicted number of transactions in the prediction period is CET which can be compared against actual.x, the number of observed transactions during the prediction period. (2) The predicted number of transactions * predicted mean spending per transaction (CET*predicted.mean.spending = customer's predicted value in the prediction period) can be compared against actual.total.spending.

Based on these predicted vs actuals you can calculate popular performance measures such as RMSE, MAE, MAPE etc which each of course have their own strength and weaknesses.

To compare PAlive vs actually being alive (actual.x > 0) is a little more tricky. Something like a ROC curve / AUC can work for this.

See discussion in https://github.com/bachmannpatrick/CLVTools/issues/150

mmeierer avatar Jun 16 '21 08:06 mmeierer

Should we use all the customer data at once or split the data by cohorts?

Usually the models implemented in CLVTools are fitted on cohort-level data.

Cohorts are groups of customers defined by the time they joined your company (=did their first purchase) and are sometimes further split by acquisition channel or type of customer (b2c vs b2b). It is assumed that customers in different cohorts substantially differ between each other (i.e., early adopter vs late adopters, life-cycle stage, etc). For more information on cohort analysis, please see here: https://en.wikipedia.org/wiki/Cohort_analysis

Hence, you should first divide your data in cohorts (time of acquisition) and fit a separate Pareto/NBD model on each cohort.

What is the best time-span to define a cohort depends on your specific data set and industry? It depends largely on how many customers you acquire per day/week/month/quarter/year. Common are 1-month, 3-month, or 1-year.

To assess the goodness of fit and prediction quality, you may want to split the cohort-wise data in an "estimation" and "holdout" period using the parameter estimation.end in clvdata(). The model is then only fit on the data in the estimation period and can be compared to the actual data in the holdout period.

Once you are happy with your model fit and want to measure/predict its current CLV, standing at the last possible timepoint (="now"), you fit the model on all the data that is available for this cohort and then predict any period ahead. CLV for the pnbd standard model is calculated from DERT*mean.spending.per.transaction and is therefore independent of the prediction horizon. This is because, by definition, DERT is an analytical expression from the end of the fitting period until infinity (because DERT is from the end of the fitting period, is also why you fit the model on all of the cohort's data for the final CLV assessment, without any holdout period)

Because there is less data available for some cohorts (earlier cohort have more data, later cohorts have less) there may be different confidence in the estimated parameter values of different cohorts.

See discussion in https://github.com/bachmannpatrick/CLVTools/issues/146

mmeierer avatar Jun 16 '21 08:06 mmeierer

What does the following error message mean: "Hessian could not be derived. Setting all entries to NA."?

This issue is likely related to the optimization method that is used to maximize the likelihood function. The error therefore is not really about the hessian that could not be derived but in general about the estimation failing.

See discussion in https://github.com/bachmannpatrick/CLVTools/issues/132

mmeierer avatar Jun 16 '21 08:06 mmeierer

What does the following error message mean: "Error: The estimation split is too short! Not all customers of this cohort had their first actual transaction until the specified estimation.split!"?

Customer which are not alive during the estimation period (i.e., they are not part of the cohort) should also not be present when calculating summary statistics, plotting transactions, etc for the holdout period because they do not belong to this cohort. The way to ensure this, is to remove them from the data.

Choosing estimation end is closely related to how you define your cohorts and to which cohorts your customers belong. The holdout transaction data serves as the "validation" set on which the model's performance is evaluated, for the cohort it was fit on.

The error is therefore to how you define your cohorts.

We could have internally and automatically removed all customers which are not alive in the estimation period. However, we believe that there needs to be transparency about this (ie the customers are not "swallowed" without noticing) and we leave it to the user to remove customers.

See discussion in https://github.com/bachmannpatrick/CLVTools/issues/101

mmeierer avatar Jun 16 '21 08:06 mmeierer

I keep getting "Warning: The KKT optimality conditions are not both met for the fitted gg spending model", but it does predict the CLV. Does this mean, the CLV predicted is NOT reliable.

The KKT criteria are measures to determine if the optimizer has reached a global optimum. However, those metrics are not perfect, but provide some guidance.

See discussion in https://github.com/bachmannpatrick/CLVTools/issues/165

mmeierer avatar Jun 16 '21 08:06 mmeierer

I have three year transactional data for customers and I am using CLVTools package (which BTW is awesome) to calculate CLV. I have included a screenshot of sample output. For a large number of customers I get DERT to be NA (and consequently CLV to be NA). I use the discount factor of 7%. Any initial ideas what I might be doing wrong?

This is happens mostly because the estimated parameters, likely beta, are very large. This then leads to issues with numerical stability when calculating the hypergeometric 1F1 function required for DERT.

See discussion in https://github.com/bachmannpatrick/CLVTools/issues/163

mmeierer avatar Jun 16 '21 08:06 mmeierer

When installing CLVTools, I got this error:

configure: error: gsl-config not found, is GSL installed?
ERROR: configuration failed for package ‘RcppGSL’

What to do?

To compile the package from source, please be advised that CLVTools relies on an external C++ library called GSL. This library has to be installed on your computer to be able to compile CLVTools from source. Follow these 3 steps:

  1. Update to the latest version of R.

  2. Install the external dependency (GSL):

  1. Install the development version from source:
devtools::install_github("bachmannpatrick/CLVTools", ref = "development")

mmeierer avatar Jun 18 '21 10:06 mmeierer

How to define dynamic (and static) covariates?

Please have a look at the following document (page 10): https://cran.r-project.org/web/packages/CLVTools/vignettes/CLVTools.pdf

At the bottom of this page you will find an example how to define a dynamic covariate (marketing) and additionally, also some statistic covariates. Be sure to check out how the object apparelDynCov is structured. This provides a pretty straightforward template how to prepare your data.

See discussion in https://github.com/bachmannpatrick/CLVTools/issues/178

mmeierer avatar Jul 14 '21 18:07 mmeierer