Added T-Statistic calculation for linear model for feature request/issue #412

Open ardave opened this issue 9 years ago • 0 comments

Using this article for inspiration (scroll down to Hypothesis Tests in Simple Linear Regression - t Tests).

Main Ideas: A) I calculated the T-Statistic on a per-parameter basis (slope vs. y-intercept), which makes me realize that I also probably should have done so as well for the Standard Error Calculation that I recently committed. Thoughts?

B) I've again stuck with IEnumerables for the method arguments for consistency, but upon further thought I might ask to refactor all methods in this class to use double[]s instead, or at least to add method overloads which immediately convert the IEnumerables to arrays to be provided to the new overloads . The nature of the computations performed by the methods in this class is such that the entire sequences are iterated every time, with no potential for deferred execution, so it seems that using arrays would create more "honest" method signatures, and would provide some greatly simplified iteration as well as preemptive length checking, etc. Also as I understand it, a for loop over an array might perform slightly better over very large data sets, though I have not tested this myself.

C) Biggest thought: The method signature for calculating this T-Statistic seems a bit gnarly, and seems to require quite a bit more knowledge on the part of the user than perhaps should be necessary. Would you perhaps like me to forgo the code in this pull request entirely in favor of a "linear model object"-type solution as can be found in other numerical analysis packages?

One example of a "linear model object" strategy can be seen in the lm model in R. You provide your inputs to the constructor/creation function of the lm object, and the result is an object that contains information as follows: Additionally, this model object contains behavioral knowledge to predict further output values using the parameters it fitted during its construction, provided that further inputs are provided to its prediction function in the same format as the initial training data.

This seems like it would represent a bit of a departure from the current style of functionality provided by this project, but as it stands, to get a T-Statistic the user has to understand 3 separate intricate and connected steps: fitting the line, using the fitted line parameters to create new predictions, and then passing all of the information from steps 1 & 2 into the T-Statistic function, which I fear might be a similar situation when I go on to add the F-Statistic calculation, P-value, etc.

Additionally, named object properties can improve the user experience and code readability, as compared to retrieving .Item1, .Item2, etc. from a Tuple.

I'd be interested to hear what your thoughts are.

Jan 05 '17 06:01 ardave