Text Analysis Tutorial: Setup & Template Walkthrough
Got it — we’ll start with T0: Setup & Template Walkthrough and instantiate it into the tutorial sub-issue format. I’ll fill in the details and add a Notes field at the end for context and possible improvements. Here’s the draft:
Title & Overview
Template: Setup & Template Walkthrough: An Intermediate, End-to-End Analysis Tutorial Overview (≤2 sentences): This tutorial introduces learners to a standardized workflow for NLP experiments, covering environment setup, data hygiene, reproducibility, and report generation. It is intermediate because it emphasizes rigorous experiment scaffolding and governance rather than just “getting a model to run.”
Purpose
The value-add is building defensible, reproducible baselines while setting up a robust project structure. Learners move beyond single notebooks toward versioned experiments, config management, error analysis scaffolding, and light reporting/serving foundations.
Prerequisites
- Skills: Python, Git, virtual envs; basic NumPy/pandas; ML basics (train/val/test, overfitting, regularization).
- NLP: tokenization (wordpiece/BPE), embeddings vs TF-IDF, evaluation metrics (accuracy, macro-F1).
- Tooling: scikit-learn or gensim; spaCy or NLTK; Hugging Face Transformers or Haystack.
Setup Instructions
- Environment: Conda/Poetry (Python 3.11), deterministic seeds,
.env(for secrets/paths, e.g.,MLFLOW_TRACKING_URI,DATA_CACHE_DIR) - Install: pandas, scikit-learn, spaCy, Hugging Face Transformers, Datasets, MLflow, FastAPI, Uvicorn.
- Dataset: use IMDB (small) and AG News (medium) classification datasets (HF Datasets catalog). Both have permissive licenses and train/validation/test splits.
- Repo layout:
tutorials/setup_template/
├─ notebooks/
├─ src/
│ ├─ utils.py
│ ├─ data.py
│ ├─ baseline.py
│ └─ serve.py
├─ configs/
│ └─ baseline.yaml
├─ reports/
├─ data/README.md
├─ .env.example
└─ requirements.txt
Core Concepts
- Determinism in ML experiments: seeds, config files, pinned deps.1
- Reproducibility: track dataset versions, metrics, and commits.
- Data hygiene: leakage prevention, split integrity, license notes.
- Governance: documenting metrics tables, configs, and error analysis.
- Guardrails: schema validation, simple checks before training or serving.
Step-by-Step Walkthrough
What you’ll build: a tiny app that reads text (movie reviews/news), turns it into numbers (TF-IDF), trains a simple classifier (Logistic Regression), and tracks results with MLflow. Optional: a tiny FastAPI endpoint to get predictions.
- Make the project folder Windows(Powershell)
# Create folders
New-Item -ItemType Directory -Force -Path tutorials\setup_template\notebooks | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\src | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\configs | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\reports | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\data | Out-Null
# Create empty files we’ll fill next
New-Item tutorials\setup_template\.env.example -ItemType File -Force | Out-Null
New-Item tutorials\setup_template\requirements.txt -ItemType File -Force | Out-Null
New-Item tutorials\setup_template\configs\baseline.yaml -ItemType File -Force | Out-Null
macOS (Apple Silicon, zsh):
mkdir -p tutorials/setup_template/{notebooks,src,configs,reports,data}
touch tutorials/setup_template/.env.example \
tutorials/setup_template/requirements.txt \
tutorials/setup_template/configs/baseline.yaml
Project layout (for reference):
tutorials/setup_template/
├─ notebooks/
├─ src/
│ ├─ utils.py
│ ├─ data.py
│ ├─ eda.py
│ ├─ baseline.py
│ └─ serve.py
├─ configs/
│ └─ baseline.yaml
├─ reports/
├─ data/README.md
├─ .env.example
└─ requirements.txt
- Create and activate Python environment (Python 3.11)
Windows (PowerShell):
conda create -n nlp311 python=3.11 -y
conda activate nlp311
macOS (zsh):
conda create -n nlp311 python=3.11 -y
conda activate nlp311
# Optional: if builds fail on Apple Silicon
python -m pip install --upgrade pip wheel setuptools
- Install the packages
Open
tutorials/setup_template/requirements.txtand paste:
pandas==2.2.2
scikit-learn==1.5.2
spacy==3.7.6
matplotlib==3.9.2
datasets==3.0.1
transformers==4.44.2
mlflow==2.16.2
python-dotenv==1.0.1
pydantic==2.9.2
fastapi==0.115.0
uvicorn==0.30.6
Then install: Windows
cd tutorials\setup_template
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
python -m spacy download en_core_web_sm
macOS:
cd tutorials/setup_template
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
python -m spacy download en_core_web_sm
- Add config and environment variables
.env.example(then copy to.env):
MLFLOW_TRACKING_URI=./mlruns
DATA_CACHE_DIR=./.hf_cache
Copy example to real file:
Windows:
Copy-Item .env.example .env -Force
macOS:
cp .env.example .env
configs/baseline.yaml— paste:
experiment_name: "t0_setup_template"
dataset: "imdb" # options: imdb, ag_news
test_size: 0.2
random_state: 42
tfidf:
max_features: 30000
ngram_range: [1, 2]
model:
type: "logreg"
C: 2.0
max_iter: 200
metrics:
average: "macro"
- Add the code files
Create and paste the code below into files inside
src/5.1src/utils.py
import os, random, numpy as np
def set_all_seeds(seed: int = 42):
random.seed(seed)
np.random.seed(seed)
try:
import torch
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
except Exception:
# torch is optional; ignore if not installed
pass
def get_env(name: str, default: str = "") -> str:
from dotenv import load_dotenv
load_dotenv()
return os.getenv(name, default)
5.2 src/data.py
from datasets import load_dataset
import pandas as pd
from collections import Counter
def load_text_classification(name, cache_dir=None):
"""
Loads a Hugging Face dataset and returns 3 DataFrames:
train_df, valid_df (or None), test_df with columns: text, label
"""
ds = load_dataset(name, cache_dir=cache_dir)
train_df = pd.DataFrame(ds["train"])
test_df = pd.DataFrame(ds["test"])
valid_df = pd.DataFrame(ds["validation"]) if "validation" in ds else None
return train_df, valid_df, test_df
def describe_dataset(df, text_col="text", label_col="label"):
lengths = df[text_col].astype(str).str.split().map(len)
counts = Counter(df[label_col])
return {
"rows": len(df),
"avg_tokens": float(lengths.mean()),
"median_tokens": float(lengths.median()),
"label_counts": dict(counts),
}
5.3 src/eda.py
# Quick, beginner-friendly EDA that saves pictures into reports/
import os, yaml
import matplotlib.pyplot as plt
from src.utils import set_all_seeds, get_env
from src.data import load_text_classification, describe_dataset
def main(cfg_path="configs/baseline.yaml"):
set_all_seeds(42)
cfg = yaml.safe_load(open(cfg_path))
cache = get_env("DATA_CACHE_DIR", "./.hf_cache")
train_df, valid_df, test_df = load_text_classification(cfg["dataset"], cache_dir=cache)
# 1) Print simple stats
print("TRAIN:", describe_dataset(train_df))
if valid_df is not None:
print("VALID:", describe_dataset(valid_df))
print("TEST :", describe_dataset(test_df))
# 2) Plot token length histogram (train)
lengths = train_df["text"].astype(str).str.split().map(len)
plt.figure()
lengths.hist(bins=50)
plt.xlabel("Tokens per example"); plt.ylabel("Count"); plt.title("Token Lengths (train)")
os.makedirs("reports", exist_ok=True)
plt.savefig("reports/eda_token_lengths.png", dpi=160, bbox_inches="tight")
print("Saved: reports/eda_token_lengths.png")
if __name__ == "__main__":
main()
5.4src/baseline.py
import os, yaml, mlflow, mlflow.sklearn
from src.utils import set_all_seeds, get_env
from src.data import load_text_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
def run_baseline(cfg_path="configs/baseline.yaml"):
set_all_seeds(42)
cfg = yaml.safe_load(open(cfg_path))
mlflow.set_tracking_uri(get_env("MLFLOW_TRACKING_URI", "./mlruns"))
mlflow.set_experiment(cfg["experiment_name"])
train_df, valid_df, test_df = load_text_classification(
cfg["dataset"], cache_dir=get_env("DATA_CACHE_DIR", "./.hf_cache")
)
if valid_df is None:
train_df, valid_df = train_test_split(
train_df, test_size=cfg["test_size"], random_state=cfg["random_state"], stratify=train_df["label"]
)
X_train, y_train = train_df["text"].astype(str), train_df["label"]
X_valid, y_valid = valid_df["text"].astype(str), valid_df["label"]
X_test, y_test = test_df["text"].astype(str), test_df["label"]
pipe = Pipeline([
("tfidf", TfidfVectorizer(
max_features=cfg["tfidf"]["max_features"],
ngram_range=tuple(cfg["tfidf"]["ngram_range"])
)),
("clf", LogisticRegression(
C=cfg["model"]["C"],
max_iter=cfg["model"]["max_iter"]
))
])
with mlflow.start_run():
# log params
mlflow.log_params({
"dataset": cfg["dataset"],
"tfidf_max_features": cfg["tfidf"]["max_features"],
"tfidf_ngram_range": str(cfg["tfidf"]["ngram_range"]),
"model": cfg["model"]["type"],
"C": cfg["model"]["C"],
"max_iter": cfg["model"]["max_iter"],
"random_state": cfg["random_state"]
})
pipe.fit(X_train, y_train)
y_pred_valid = pipe.predict(X_valid)
y_pred_test = pipe.predict(X_test)
# metrics
acc_valid = accuracy_score(y_valid, y_pred_valid)
f1_valid = f1_score(y_valid, y_pred_valid, average=cfg["metrics"]["average"])
acc_test = accuracy_score(y_test, y_pred_test)
f1_test = f1_score(y_test, y_pred_test, average=cfg["metrics"]["average"])
mlflow.log_metrics({
"valid_accuracy": acc_valid,
"valid_f1_macro": f1_valid,
"test_accuracy": acc_test,
"test_f1_macro": f1_test
})
# save confusion matrix
os.makedirs("reports", exist_ok=True)
fig = ConfusionMatrixDisplay.from_predictions(y_test, y_pred_test).figure_
fig.savefig("reports/confusion_matrix.png", dpi=180, bbox_inches="tight")
mlflow.log_artifact("reports/confusion_matrix.png")
# save text report
report = classification_report(y_test, y_pred_test)
with open("reports/classification_report.txt", "w") as f:
f.write(report)
mlflow.log_artifact("reports/classification_report.txt")
# save model
mlflow.sklearn.log_model(pipe, artifact_path="model")
print("Validation -> acc:", acc_valid, "f1_macro:", f1_valid)
print("Test -> acc:", acc_test, "f1_macro:", f1_test)
print("\nClassification report saved at reports/classification_report.txt")
if __name__ == "__main__":
run_baseline()
5.5 (Optional)src/serve.py
from fastapi import FastAPI
from pydantic import BaseModel
import mlflow.pyfunc, glob
app = FastAPI(title="T0 Baseline Inference")
class InferRequest(BaseModel):
text: str
def _latest_model_path():
# look for the newest model saved by MLflow locally
candidates = sorted(glob.glob("mlruns/*/*/artifacts/model"))
if not candidates:
raise RuntimeError("No model artifacts found. Run baseline first.")
return candidates[-1]
@app.post("/infer")
def infer(payload: InferRequest):
model = mlflow.pyfunc.load_model(_latest_model_path())
pred = model.predict([payload.text])
return {"label": int(pred[0])}
- Run it — EDA → Baseline → MLflow 6.1 EDA (quick checks) Windows:
$env:MLFLOW_TRACKING_URI = ".\mlruns"
$env:DATA_CACHE_DIR = ".\.hf_cache"
python .\src\eda.py
macOS:
export MLFLOW_TRACKING_URI=./mlruns
export DATA_CACHE_DIR=./.hf_cache
python ./src/eda.py
You’ll see dataset stats printed and an image saved to reports/eda_token_lengths.png
6.2 Train the baseline + log to MLflow
Windows:
python .\src\baseline.py
mlflow ui --backend-store-uri ".\mlruns" --host 127.0.0.1 --port 5000
macOS:
python ./src/baseline.py
mlflow ui --backend-store-uri ./mlruns --host 127.0.0.1 --port 5000
Open http://127.0.0.1:5000/ → you’ll see your run, parameters, metrics, and artifacts (confusion_matrix.png, classification_report.txt, model). Expectations: IMDB usually gets a solid accuracy with TF-IDF + Logistic Regression (often ~0.85–0.9). AG News will be lower/harder because it’s 4-class.
- Optional: Run a tiny API for inference Start the server:
# macOS (zsh); on Windows, replace slashes with backslashes and use PowerShell
uvicorn src.serve:app --host 127.0.0.1 --port 8000 --reload
Test it (one example): macOS (curl):
curl -X POST http://127.0.0.1:8000/infer \
-H "Content-Type: application/json" \
-d '{"text":"A surprisingly heartfelt and funny movie."}'
Windows (PowerShell):
$body = @{ text = "A surprisingly heartfelt and funny movie." } | ConvertTo-Json
Invoke-RestMethod -Method Post -Uri http://127.0.0.1:8000/infer -ContentType "application/json" -Body $body
- Switch dataset and re-run (practice)
Change
configs/baseline.yaml:
dataset: "ag_news"
Then re-run steps 6.1 and 6.2. Compare metrics in MLflow. This teaches that different tasks/datasets change difficulty and results.
Tiny glossary (for absolute beginners)
- Token: a piece of text, usually a word.
- TF-IDF: a way to turn text into numbers by counting words and down-weighting common ones.
- Logistic Regression: a simple, reliable classifier.
- Train / Validation / Test: train the model, tune it on validation, and report final scores on test.
- Accuracy: how often predictions are correct.
- Macro-F1: balances precision/recall across classes; good when classes are uneven.
Common Pitfalls & Troubleshooting
- Install fails on Mac M-series: run
python -m pip install --upgrade pip wheel setuptoolsand try again. - spaCy model error: run
python -m spacy download en_core_web_sm. - MLflow UI empty: make sure you ran
src/baseline.pybefore opening the UI. - No model found for API: run the baseline once to create a model artifact.
Additional Resources
- TF-IDF (scikit-learn)
- Logistic Regression (scikit-learn)
- Datasets (Hugging Face)
- MLflow Tracking
- FastAPI Tutorial
-
Environment setup: Conda/Poetry,
.env, fixed seeds. -
Dataset load: download IMDB/AG News, verify splits, save schema in
data/README.md. - EDA: class balance, token length distributions.
- Baseline sanity: TF-IDF + Logistic Regression, log metrics table.
- Experiment governance: config YAML for hyperparams, metrics logging to MLflow.
-
Reporting: generate
reports/setup_template.mdfrom notebook with nbconvert. - (Optional) Serve: demo FastAPI endpoint for inference with schema validation.
Hands-On Exercises
- Try both datasets (IMDB vs AG News) and compare reproducibility logs.
- Add noise (duplicates, shuffle seeds) to test determinism.
- Run ablations: turn off seed fixing, compare reproducibility.
- Stretch: connect MLflow run metadata to Weights & Biases.
Common Pitfalls & Troubleshooting
- Forgetting to set seeds → non-reproducible results.
- Data leakage from overlapping splits.
- Unpinned dependencies breaking reproducibility.
- Missing
.envleads to secret path issues. - CI not running → unchecked notebook failures.
Best Practices
- Always log commit hash, dataset version, config.
- PR checklist: metrics ≥ baseline, README updated, tests green.
- Write unit tests for tokenization, vectorization, and schema validation.
- Keep seed-fixing utilities in
src/utils.py. - Separate experiments (configs/notebooks) from reporting.
Reflection & Discussion Prompts
- Why does reproducibility matter in civic-tech / applied NLP projects?
- What’s the tradeoff between fast iteration and strict reproducibility?
- How might governance differ in regulated vs open-data contexts?
Next Steps / Advanced Extensions
- Automate report generation in CI.
- Introduce containerized reproducibility (Docker).
- Connect experiment tracking with deployment logs.
- Move from IMDB/AG News to a civic dataset (e.g., 311 complaints).
Glossary / Key Terms
- Reproducibility: ability to re-run experiment with identical results.
- Data leakage: unintended information in train/test overlap.
- Seed fixing: controlling randomness across frameworks.
- Governance: tracking configs, metrics, and artifacts.
Additional Resources
- [scikit-learn docs](https://scikit-learn.org/stable/)
- [spaCy](https://spacy.io/)
- [Hugging Face Transformers](https://huggingface.co/docs/transformers)
- [Hugging Face Datasets](https://huggingface.co/datasets)
- [MLflow](https://mlflow.org/)
- [FastAPI](https://fastapi.tiangolo.com/)
Contributors
Author(s): TBD Reviewer(s): TBD Maintainer(s): TBD Date updated: 2025-09-20 Dataset license: IMDB (ACL), AG News (Creative Commons).
Issues Referenced
Epic: HfLA Text Analysis Tutorials (T0–T14). This sub-issue: T0 Setup & Template Walkthrough.
Notes: I chose IMDB (small binary classification) and AG News (medium 4-class classification) because they are light enough for setup/debug, yet distinct in size and task complexity. Both test the scaffolding under different load conditions. For governance, I leaned on MLflow for run-tracking (simpler than W&B but extensible). The FastAPI step is optional but sets the stage for later deployment tutorials.
- Progress: I have updated this sub-issue.
- Blockers: No blockers, but I need time to execute this project entirely. Also, I would appreciate feedback on the content.
- Availability: Available next week.
- ETA: Assuming it will take 2 weeks.
- Pictures (if necessary):