CoP: Data Science: Create Text Analysis Tutorial
Overview
Update the Text Analysis page with resources and an article header.
Action Items
- [ ] Create a Google Doc in the folder provided under resources
- [ ] Draft an introductory paragraph explaining what the tutorial resources cover and why a new data scientist would use them for working with data at Hack For LA
- [ ] Identify resources with vetted tutorials covering important skills within the tutorial area, adding to the draft
- [ ] Review the draft with the Data Science CoP
- [ ] Add to the wiki page
Resources/Instructions
Wiki page
Location for any files you might need to upload (drafts, images, etc.)
Tools that are core that should be mentioned:
- nltk
- SpaCy
Examples of resources that would be useful to include:
- Web how-to/tutorial/walk-throughs
- Youtube playlists or videos demonstrating tools
- Links to blogs or platforms with subject matter experts
Hi - I might be able to work on this. I brainstormed an outline of topics, but would appreciate some feedback on scope and where to draw the line. Should this be more focused on text processing/analysis fundamentals or do we want to go all the way to building and evaluating predictive/classification models?
Potential tutorials:
- Google Cloud NLP API: How-to Guides | Cloud Natural Language API | Google Cloud
- Google Developers ML text classification: Introduction | Machine Learning | Google for Developers
Use Cases
- Most frequent tokens, entities (beginner)
- Prediction/Classification (intermediate)
- Sentiment, Next Word, other user-defined outcomes
- Language Inference, Translation (advanced)
Text Analysis Basics
Pre-processing
- Stop Words
- Lemmatization vs Stemming
- Tokenization
- N-grams
Analysis basics (with libraries)
- Named Entity Recognition
- Part of Speech Tagging
- Topic Modeling
More advanced...
Vectorization/Encoding
- One-hot encoding (Beginner)
- BOW (Beginner)
- TF-IDF (Intermediate)
- Word embeddings (advanced)
Building Models
- Logistic Regression
- MLP
- SVM
Hyperparameter Tuning
- Bias/variance tradeoff
- Learning rate, epochs, etc
Metrics
- Accuracy, Precision, Recall, and F1
Feedback from DS meeting:
-
Connect topics and tutorials to HFLA datasets and projects
-
Build tutorial/notebook using HFLA dataset
-
[x] Review 311 dataset to see if there's textual data that's a good fit
-
[ ] Explore alternatives: affordable housing or scraping HFLA agenda issues
- [x] Identified 2 datasets with textual data: Los Angeles County Department of Arts and Culture's data on Community Impact Art Grants and Organizational Grants Program
- [ ] Create tutorial (jupyter notebook)
- pre-processing
- [ ] using nltk library (stopwords, tokenizers, stem)
- pre-processing