RJ Nowling

Results 23 issues of RJ Nowling

Implement Kendall's Tau, a measure of ordinal association. Ping @erikerlandson -- do you have an implementation sitting around you could easily make into a PR? :)

When I tried to use the `sbt` script to build `silex`, the script reported an error about retrieving `sbt-launch.jar`. I noticed that it tried to use the old `artifactoryonline.com` repo....

``` [info] SplitSampleSpec: [info] - should provide splitSample with integer argument [info] - should provide weightedSplitSample with weights argument *** FAILED *** [info] false was not true (split.scala:62) ``` https://travis-ci.org/willb/silex/jobs/119218111...

`IIDFeatureSamplingMethodsRDDSpec` produces warnings about containing large tasks. These should be squashed to increase readability of the tests by reducing the logging level.

[Cramer's V](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V) is a measure of association between nominal (categorical) variables. Useful for feature selection, comparing clusterings, potentially evaluating splits in Decision Trees trained on purely categorical data, etc.

Silex has several utilities such as `PerTestSparkContext` which make writing unit tests for Spark applications easier. Could Silex provide similar utilities to the apps using it? If so, what should...

For many of the use cases we've seen, parallelism per file versus per line is sufficient performance wise and makes parsing files easier. (e.g. JSON, CSV, etc. file formats which...

Custom partitioning can be make certain operations easier (e.g., grouping data to control mapping between data and files). We should evaluate the space of how custom partitioning can be used...

Current implementation of JSON schema transformation only supports RDDs. We should support DataFrames, too.

The links to the Wikipedia page on N-grams and the L-p vector normalization are not being interpreted correctly on the TextFeaturizingEstimator page: ```markdown This estimator gives the user one-stop solution...