NLP support
What changes are proposed in this pull request?
Creating a Spark MLlib Estimator API which can integrated with tensorflow code, with a reference implementation. Creating a Spark MLlib Transformer convert text column to 2-D vector which can be feeded to CNN/LSTM directly.
It provides a taste of how to process text in a DataFrame and use them to train a NLP model developed by tensorflow. Also fix issue: https://github.com/databricks/spark-deep-learning/issues/53
The changes consist of these components.
TFTextFileEstimator/TFTextTransformer New shard params
How is this patch tested?
- Unit tests
- Manual tests
Thank you for contributing to the project @allwefantasy!
The CI fails because there is no kafka lib in the env. Is there something i can do to fix this ?
@allwefantasy thank you very much for the contribution. I will have more comments for the estimator, so would you mind splitting your PR into the transformer part and into the estimator?
Also, I see that the transformer is embedding Word2Vec. Have you considered chaining them in a pipeline instead? https://spark.apache.org/docs/2.1.1/ml-pipeline.html
Regarding kafka, you should be able to add it in this file: https://github.com/databricks/spark-deep-learning/blob/master/python/requirements.txt
@thunterdb TFTextTransformer is a tool like StringIndexer in MLlib, which we can use to transform the dataframe and feed the new dataframe to TFTextFileEstimator. It seems no need to split into two PRs.
Using word2vec is in order to compute a Map which contains the mapping of word to vector. We do not need the word2vec model's transform function.
Codecov Report
Merging #56 into master will decrease coverage by
3.53%. The diff coverage is62.79%.
@@ Coverage Diff @@
## master #56 +/- ##
==========================================
- Coverage 82.82% 79.29% -3.54%
==========================================
Files 23 25 +2
Lines 1217 1473 +256
Branches 5 5
==========================================
+ Hits 1008 1168 +160
- Misses 209 305 +96
| Impacted Files | Coverage Δ | |
|---|---|---|
| python/sparkdl/transformers/keras_applications.py | 93.93% <100%> (+2.1%) |
:arrow_up: |
| python/sparkdl/transformers/utils.py | 100% <100%> (ø) |
:arrow_up: |
| python/sparkdl/transformers/named_image.py | 93.51% <100%> (ø) |
:arrow_up: |
| ...ython/sparkdl/estimators/tf_text_file_estimator.py | 48.02% <48.02%> (ø) |
|
| python/sparkdl/transformers/tf_text.py | 78.26% <78.26%> (ø) |
|
| python/sparkdl/param/shared_params.py | 80.88% <82.5%> (+0.67%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 3f668d9...99d2b30. Read the comment docs.