data-science CoP: Data Science: Create Text Analysis Tutorial

Overview

Update the Text Analysis page with resources and an article header.

Action Items

[ ] Create a Google Doc in the folder provided under resources
- [ ] Draft an introductory paragraph explaining what the tutorial resources cover and why a new data scientist would use them for working with data at Hack For LA
- [ ] Identify resources with vetted tutorials covering important skills within the tutorial area, adding to the draft
[ ] Review the draft with the Data Science CoP
[ ] Add to the wiki page

Resources/Instructions

Wiki page

Text Analysis Tutorial

Location for any files you might need to upload (drafts, images, etc.)

Folder for files related to the Text Analysis tutorial
- DS: Text Analysis Tutorial Google Doc

Tools that are core that should be mentioned:

nltk
SpaCy

Examples of resources that would be useful to include:

Web how-to/tutorial/walk-throughs
Youtube playlists or videos demonstrating tools
Links to blogs or platforms with subject matter experts

Apr 01 '22 19:04 akhaleghi

Hi - I might be able to work on this. I brainstormed an outline of topics, but would appreciate some feedback on scope and where to draw the line. Should this be more focused on text processing/analysis fundamentals or do we want to go all the way to building and evaluating predictive/classification models?

Potential tutorials:

Google Cloud NLP API: How-to Guides | Cloud Natural Language API | Google Cloud
Google Developers ML text classification: Introduction | Machine Learning | Google for Developers

Use Cases

Most frequent tokens, entities (beginner)
Prediction/Classification (intermediate)
- Sentiment, Next Word, other user-defined outcomes
Language Inference, Translation (advanced)

Text Analysis Basics

Pre-processing

Stop Words
Lemmatization vs Stemming
Tokenization
N-grams

Analysis basics (with libraries)

Named Entity Recognition
Part of Speech Tagging
Topic Modeling

More advanced...

Vectorization/Encoding

One-hot encoding (Beginner)
BOW (Beginner)
TF-IDF (Intermediate)
Word embeddings (advanced)

Building Models

Logistic Regression
MLP
SVM

Hyperparameter Tuning

Bias/variance tradeoff
Learning rate, epochs, etc

Metrics

Accuracy, Precision, Recall, and F1

Apr 23 '24 01:04 bfang22

Feedback from DS meeting:

Connect topics and tutorials to HFLA datasets and projects
Build tutorial/notebook using HFLA dataset
[x] Review 311 dataset to see if there's textual data that's a good fit
[ ] Explore alternatives: affordable housing or scraping HFLA agenda issues

Apr 23 '24 02:04 bfang22

[x] Identified 2 datasets with textual data: Los Angeles County Department of Arts and Culture's data on Community Impact Art Grants and Organizational Grants Program
[ ] Create tutorial (jupyter notebook)
- pre-processing
  - [ ] using nltk library (stopwords, tokenizers, stem)

Jun 11 '24 01:06 bfang22