data-science icon indicating copy to clipboard operation
data-science copied to clipboard

CoP: Data Science: Create Text Analysis Tutorial

Open akhaleghi opened this issue 3 years ago • 3 comments

Overview

Update the Text Analysis page with resources and an article header.

Action Items

  • [ ] Create a Google Doc in the folder provided under resources
    • [ ] Draft an introductory paragraph explaining what the tutorial resources cover and why a new data scientist would use them for working with data at Hack For LA
    • [ ] Identify resources with vetted tutorials covering important skills within the tutorial area, adding to the draft
  • [ ] Review the draft with the Data Science CoP
  • [ ] Add to the wiki page

Resources/Instructions

Wiki page

Text Analysis Tutorial

Location for any files you might need to upload (drafts, images, etc.)

Tools that are core that should be mentioned:

  • nltk
  • SpaCy

Examples of resources that would be useful to include:

  • Web how-to/tutorial/walk-throughs
  • Youtube playlists or videos demonstrating tools
  • Links to blogs or platforms with subject matter experts

akhaleghi avatar Apr 01 '22 19:04 akhaleghi

Hi - I might be able to work on this. I brainstormed an outline of topics, but would appreciate some feedback on scope and where to draw the line. Should this be more focused on text processing/analysis fundamentals or do we want to go all the way to building and evaluating predictive/classification models?

Potential tutorials:

Use Cases

  • Most frequent tokens, entities (beginner)
  • Prediction/Classification (intermediate)
    • Sentiment, Next Word, other user-defined outcomes
  • Language Inference, Translation (advanced)

Text Analysis Basics

Pre-processing

  • Stop Words
  • Lemmatization vs Stemming
  • Tokenization
  • N-grams

Analysis basics (with libraries)

  • Named Entity Recognition
  • Part of Speech Tagging
  • Topic Modeling

More advanced...

Vectorization/Encoding

  • One-hot encoding (Beginner)
  • BOW (Beginner)
  • TF-IDF (Intermediate)
  • Word embeddings (advanced)

Building Models

  • Logistic Regression
  • MLP
  • SVM

Hyperparameter Tuning

  • Bias/variance tradeoff
  • Learning rate, epochs, etc

Metrics

  • Accuracy, Precision, Recall, and F1

bfang22 avatar Apr 23 '24 01:04 bfang22

Feedback from DS meeting:

  • Connect topics and tutorials to HFLA datasets and projects

  • Build tutorial/notebook using HFLA dataset

  • [x] Review 311 dataset to see if there's textual data that's a good fit

  • [ ] Explore alternatives: affordable housing or scraping HFLA agenda issues

bfang22 avatar Apr 23 '24 02:04 bfang22

  • [x] Identified 2 datasets with textual data: Los Angeles County Department of Arts and Culture's data on Community Impact Art Grants and Organizational Grants Program
  • [ ] Create tutorial (jupyter notebook)
    • pre-processing
      • [ ] using nltk library (stopwords, tokenizers, stem)

bfang22 avatar Jun 11 '24 01:06 bfang22