spark-deep-learning icon indicating copy to clipboard operation
spark-deep-learning copied to clipboard

NLP support

Open allwefantasy opened this issue 8 years ago • 5 comments

What changes are proposed in this pull request?

Creating a Spark MLlib Estimator API which can integrated with tensorflow code, with a reference implementation. Creating a Spark MLlib Transformer convert text column to 2-D vector which can be feeded to CNN/LSTM directly.

It provides a taste of how to process text in a DataFrame and use them to train a NLP model developed by tensorflow. Also fix issue: https://github.com/databricks/spark-deep-learning/issues/53

The changes consist of these components.

TFTextFileEstimator/TFTextTransformer New shard params

How is this patch tested?

  • Unit tests
  • Manual tests

allwefantasy avatar Oct 13 '17 10:10 allwefantasy

Thank you for contributing to the project @allwefantasy!

phi-dbq avatar Oct 14 '17 01:10 phi-dbq

The CI fails because there is no kafka lib in the env. Is there something i can do to fix this ?

allwefantasy avatar Oct 14 '17 02:10 allwefantasy

@allwefantasy thank you very much for the contribution. I will have more comments for the estimator, so would you mind splitting your PR into the transformer part and into the estimator?

Also, I see that the transformer is embedding Word2Vec. Have you considered chaining them in a pipeline instead? https://spark.apache.org/docs/2.1.1/ml-pipeline.html

Regarding kafka, you should be able to add it in this file: https://github.com/databricks/spark-deep-learning/blob/master/python/requirements.txt

thunterdb avatar Oct 16 '17 22:10 thunterdb

@thunterdb TFTextTransformer is a tool like StringIndexer in MLlib, which we can use to transform the dataframe and feed the new dataframe to TFTextFileEstimator. It seems no need to split into two PRs.

Using word2vec is in order to compute a Map which contains the mapping of word to vector. We do not need the word2vec model's transform function.

allwefantasy avatar Oct 18 '17 07:10 allwefantasy

Codecov Report

Merging #56 into master will decrease coverage by 3.53%. The diff coverage is 62.79%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #56      +/-   ##
==========================================
- Coverage   82.82%   79.29%   -3.54%     
==========================================
  Files          23       25       +2     
  Lines        1217     1473     +256     
  Branches        5        5              
==========================================
+ Hits         1008     1168     +160     
- Misses        209      305      +96
Impacted Files Coverage Δ
python/sparkdl/transformers/keras_applications.py 93.93% <100%> (+2.1%) :arrow_up:
python/sparkdl/transformers/utils.py 100% <100%> (ø) :arrow_up:
python/sparkdl/transformers/named_image.py 93.51% <100%> (ø) :arrow_up:
...ython/sparkdl/estimators/tf_text_file_estimator.py 48.02% <48.02%> (ø)
python/sparkdl/transformers/tf_text.py 78.26% <78.26%> (ø)
python/sparkdl/param/shared_params.py 80.88% <82.5%> (+0.67%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3f668d9...99d2b30. Read the comment docs.

codecov-io avatar Oct 18 '17 08:10 codecov-io