Epic: Create Text Analysis Tutorials

Open chinaexpert1 opened this issue 4 months ago • 1 comments

Overview

Create an Epic that delivers a cohesive, hands-on NLP tutorial series—each major task as its own sub-issue built from a shared template—covering data → modeling → evaluation → lightweight deployment. The goal is résumé-ready artifacts (code, reports, and dashboards) that do not require external stakeholders and rely on public datasets/APIs. Due to complexity, these are considered intermediate tutorials.

Action Items

In the beginning (research & design)

Use the Tutorial Template below, (single source of truth used by all sub-issues) to create sub-issues:
Pick open datasets per tutorial (two options each; small + medium), e.g., IMDB/AG News (classification), CoNLL-2003 (NER), SQuAD (QA), CNN/DailyMail (summarization), STS-benchmark (similarity), Tatoeba/WMT (MT).
Scaffold repo: /tutorials/<slug>/ with notebooks/, data/README.md, src/, tests/, reports/.
Draft acceptance rubric (reproducibility, metrics table, ablation, README quality).

If researched and ready (implementation steps)

Create sub-issues (tutorials) from this Epic (one per line):
- T0: Setup & Template Walkthrough (Conda/Poetry + Colab; data hygiene; eval reproducibility)
- T1: Text Cleaning & Tokenization (regex/normalization; Byte-Pair vs WordPiece)
- T2: Featureization (BoW/TF-IDF vs subword embeddings; when to use which)
- T3: Text Classification (baseline logistic/SVM vs Transformer fine-tune)
- T4: Named Entity Recognition (NER) (spaCy pipeline vs HF token-classification)
- T5: Sequence Labeling & POS/Chunking (CRF/Perceptron vs Transformer)
- T6: Question Answering (extractive) (baseline BM25 + reader vs DistilBERT QA)
- T7: Summarization (extractive vs abstractive; length control)
- T8: Semantic Search & Similarity (Sentence-Transformers vs spaCy vectors; ANN index)
- T9: Topic Modeling (LDA/gensim vs BERTopic; coherence vs purity)
- T10: Machine Translation (intro) (seq2seq attention vs pre-trained MT)
- T11: RAG Basics (indexing, retrieval, prompt construction; small corpus)
- T12: Evaluation & Error Analysis (confusions, calibration, slices, robustness)
- T13: Data Augmentation & Fairness (back-translation, paraphrase; bias checks)
- T14: Lightweight Deployment (FastAPI/Flask inference, input validation, simple guardrails)
Instantiate the Tutorial Template into each sub-issue with: scope, dataset links, baseline(s), target metrics, stretch goals, and checklists.
Seed code: minimal, runnable notebooks for T0–T3; CI sanity check (lint, unit tests for tokenization/vectorization).
Reporting: standard reports/<slug>.md generated from notebook (e.g., nbconvert); include metrics tables and top-error examples.
Quality gate: PR checklist (data determinism, fixed seeds, metrics ≥ baseline, README complete).

Resources/Instructions

General docs: scikit-learn (https://scikit-learn.org/), spaCy (https://spacy.io/), NLTK (https://www.nltk.org/), gensim (https://radimrehurek.com/gensim/)
Transformers & embeddings: Hugging Face Transformers (https://huggingface.co/docs/transformers), Sentence-Transformers (https://www.sbert.net/)
Pipelines & RAG: Haystack (https://docs.haystack.deepset.ai/)
Serving & tracking: FastAPI (https://fastapi.tiangolo.com/), Flask (https://flask.palletsprojects.com/), MLflow (https://mlflow.org/), Weights & Biases (https://wandb.ai/)
Datasets: Hugging Face Datasets catalog (https://huggingface.co/datasets) and Papers with Code tasks (https://paperswithcode.com/area/nlp)
Template (copy into each sub-issue):

# Title & Overview

**Template:** *{NLP Task}: An Intermediate, End-to-End Analysis Tutorial*
**Overview (≤2 sentences):** Summarize what learners will build and why this is **intermediate** (e.g., beyond “fit a model,” toward rigorous analysis, evaluation, and reproducibility).

# Purpose

State the value-add beyond beginner level: defensible baselines vs small transformers, robust featureization, slice-based error analysis, reproducible experiments, and light deployment/reporting.

# Prerequisites

* Skills: Python, Git, virtual envs; basic NumPy/pandas; ML basics (train/val/test, overfitting, regularization).
* NLP: tokenization (wordpiece/BPE), embeddings vs TF-IDF, evaluation metrics (accuracy, macro-F1).
* Tooling (pick 1 per pair when applicable): spaCy **or** NLTK; scikit-learn **or** gensim; Hugging Face Transformers **or** Haystack.

# Setup Instructions

* Environment: Conda/Poetry (Python 3.11), deterministic seeds, `.env` for secrets/paths.
* Install (choose 1 per pair, keep consistent across the tutorial):

* Data: pandas **or** polars
* Vectorization: scikit-learn TF-IDF **or** gensim
* Embeddings: Sentence-Transformers **or** spaCy pipelines
* Topic modeling (if relevant): gensim LDA **or** BERTopic
* Evaluation: scikit-learn **or** seqeval/torchmetrics
* Serving: FastAPI **or** Flask
* Tracking: MLflow **or** Weights & Biases
* Data: pick a public dataset (HF Datasets), note license, splits, and any filters.
* Repo layout: `src/`, `notebooks/`, `configs/`, `data/README.md`, `reports/`, `tests/`.

# Core Concepts

Explain the *why* behind the workflow:

* Tokenization tradeoffs (word/char/subword) and OOV handling.
* Featureization: n-grams vs dense embeddings; when each is preferable.
* Baseline-first philosophy; calibration & thresholding; imbalanced data strategy.
* Slice-based error analysis; reproducibility and run governance (config + seed + commit hash).
* Clear separation of *model core* vs *guardrails/safety layers* (input validation, schema checks).

# Step-by-Step Walkthrough

1. **Data intake & splits** (stratified, reproducible) → **EDA** (class balance, length stats).
2. **Baseline A (classical):** TF-IDF + {LogReg | LinearSVM}; tune `C`, `n-gram` range.
3. **Baseline B (neural):** small Transformer fine-tune; freeze/unfreeze plan; LR warmup; early stopping.
4. **Evaluation:** macro-F1, accuracy; calibration (ECE); per-slice metrics (by length/domain/label).
5. **Error analysis:** confusion pairs, hardest examples, failure taxonomies, OOD checks.
6. **Reporting:** export metrics tables, error examples, and decisions to `reports/<slug>.md`.
7. *(Optional)* **Serve:** minimal FastAPI/Flask endpoint with input schema + simple guardrails.

# Hands-On Exercises

* Ablations: tokenizer choice, `n-gram` window, max seq-len, LR/epochs, freeze ratio.
* Robustness: noise/typos, domain shift split; evaluate delta in macro-F1.
* Fairness/slices: compare metrics across salient attributes (if present).
* Stretch: knowledge-distill to a smaller model; ANN semantic search demo with embeddings.

# Common Pitfalls & Troubleshooting

* **Data leakage:** duplicates across splits, preprocessing mismatch train vs infer.
* **Tokenizer/model mismatch:** bad vocab; truncation causing label drift.
* **Metrics misuse:** micro vs macro; thresholding without calibration.
* **Resource issues:** OOM from long sequences; mitigate with truncation + grad accumulation.
* **Error handling pattern:** wrap I/O in `try/except` with clear `print()` messages for file-not-found and retries; **fail fast** with a trace on data type mismatches; log warnings for recoverable issues.

# Best Practices

* Track config, seeds, and git commit per run; log artifacts (confusion matrices, PR curves).
* Keep a **Baseline → Improvement** narrative; avoid overfitting the leaderboard.
* Small, testable functions; unit tests for tokenization/vectorization/metrics.
* Separate core model from guardrails (validation, schema, limits) for clarity and maintainability.

# Reflection & Discussion Prompts

* Where does the model fail and why? What additional signals or data would fix it?
* How would you adapt this pipeline to civic-tech/public datasets (privacy, governance, bias)?
* What’s the real-world impact of a +Δ macro-F1 improvement in this task?

# Next Steps / Advanced Extensions

* Domain adaptation; PEFT/LoRA; quantization for CPU.
* Data augmentation (BT/paraphrase), prompt-augmented retrieval (RAG) for retrieval-heavy tasks.
* Batch inference pipeline; lightweight monitoring (drift checks, alerting).

# Glossary / Key Terms

Define task-specific terms (e.g., subword, OOV, calibration, ECE, macro-F1, slice, leakage, PEFT, LoRA).

# Additional Resources

List 4–6 canonical links (docs for scikit-learn, spaCy, HF Transformers & Datasets, Sentence-Transformers, BERTopic/gensim, FastAPI/Flask, MLflow/W\&B).

# Contributors

Author(s), reviewer(s), maintainer(s), date updated; dataset license notes.

# Issues Referenced

Link back to the Epic and the sub-issue this tutorial belongs to; include any prior discussions or decisions.

311 data:
- Not applicable. This Epic uses public datasets/APIs and does not require 311 access.
- One-time vs ongoing dump: N/A
- Subset definition: N/A
- Access method: N/A

Medium Article Listing the Different Fields of NLP

Sep 11 '25 23:09 chinaexpert1

Created 15 sub-issues.

Sep 20 '25 21:09 chinaexpert1