Text Analysis Tutorial: Setup & Template Walkthrough

Open chinaexpert1 opened this issue 4 months ago • 1 comments

Got it — we’ll start with T0: Setup & Template Walkthrough and instantiate it into the tutorial sub-issue format. I’ll fill in the details and add a Notes field at the end for context and possible improvements. Here’s the draft:

Title & Overview

Template: Setup & Template Walkthrough: An Intermediate, End-to-End Analysis Tutorial Overview (≤2 sentences): This tutorial introduces learners to a standardized workflow for NLP experiments, covering environment setup, data hygiene, reproducibility, and report generation. It is intermediate because it emphasizes rigorous experiment scaffolding and governance rather than just “getting a model to run.”

Purpose

The value-add is building defensible, reproducible baselines while setting up a robust project structure. Learners move beyond single notebooks toward versioned experiments, config management, error analysis scaffolding, and light reporting/serving foundations.

Prerequisites

Skills: Python, Git, virtual envs; basic NumPy/pandas; ML basics (train/val/test, overfitting, regularization).
NLP: tokenization (wordpiece/BPE), embeddings vs TF-IDF, evaluation metrics (accuracy, macro-F1).
Tooling: scikit-learn or gensim; spaCy or NLTK; Hugging Face Transformers or Haystack.

Setup Instructions

Environment: Conda/Poetry (Python 3.11), deterministic seeds, .env (for secrets/paths, e.g., MLFLOW_TRACKING_URI, DATA_CACHE_DIR)
Install: pandas, scikit-learn, spaCy, Hugging Face Transformers, Datasets, MLflow, FastAPI, Uvicorn.
Dataset: use IMDB (small) and AG News (medium) classification datasets (HF Datasets catalog). Both have permissive licenses and train/validation/test splits.
Repo layout:

tutorials/setup_template/
├─ notebooks/
├─ src/
│  ├─ utils.py
│  ├─ data.py
│  ├─ baseline.py
│  └─ serve.py
├─ configs/
│  └─ baseline.yaml
├─ reports/
├─ data/README.md
├─ .env.example
└─ requirements.txt

Core Concepts

Determinism in ML experiments: seeds, config files, pinned deps.1
Reproducibility: track dataset versions, metrics, and commits.
Data hygiene: leakage prevention, split integrity, license notes.
Governance: documenting metrics tables, configs, and error analysis.
Guardrails: schema validation, simple checks before training or serving.

Step-by-Step Walkthrough

What you’ll build: a tiny app that reads text (movie reviews/news), turns it into numbers (TF-IDF), trains a simple classifier (Logistic Regression), and tracks results with MLflow. Optional: a tiny FastAPI endpoint to get predictions.

Make the project folder Windows(Powershell)

# Create folders
New-Item -ItemType Directory -Force -Path tutorials\setup_template\notebooks | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\src | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\configs | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\reports | Out-Null
New-Item -ItemType Directory -Force -Path tutorials\setup_template\data | Out-Null

# Create empty files we’ll fill next
New-Item tutorials\setup_template\.env.example -ItemType File -Force | Out-Null
New-Item tutorials\setup_template\requirements.txt -ItemType File -Force | Out-Null
New-Item tutorials\setup_template\configs\baseline.yaml -ItemType File -Force | Out-Null

macOS (Apple Silicon, zsh):

mkdir -p tutorials/setup_template/{notebooks,src,configs,reports,data}
touch tutorials/setup_template/.env.example \
      tutorials/setup_template/requirements.txt \
      tutorials/setup_template/configs/baseline.yaml

Project layout (for reference):

tutorials/setup_template/
├─ notebooks/
├─ src/
│  ├─ utils.py
│  ├─ data.py
│  ├─ eda.py
│  ├─ baseline.py
│  └─ serve.py
├─ configs/
│  └─ baseline.yaml
├─ reports/
├─ data/README.md
├─ .env.example
└─ requirements.txt

Create and activate Python environment (Python 3.11)

Windows (PowerShell):

conda create -n nlp311 python=3.11 -y
conda activate nlp311

macOS (zsh):

conda create -n nlp311 python=3.11 -y
conda activate nlp311
# Optional: if builds fail on Apple Silicon
python -m pip install --upgrade pip wheel setuptools

Install the packages Open tutorials/setup_template/requirements.txt and paste:

pandas==2.2.2
scikit-learn==1.5.2
spacy==3.7.6
matplotlib==3.9.2
datasets==3.0.1
transformers==4.44.2
mlflow==2.16.2
python-dotenv==1.0.1
pydantic==2.9.2
fastapi==0.115.0
uvicorn==0.30.6

Then install: Windows

cd tutorials\setup_template
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
python -m spacy download en_core_web_sm

macOS:

cd tutorials/setup_template
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
python -m spacy download en_core_web_sm

Add config and environment variables .env.example (then copy to .env):

MLFLOW_TRACKING_URI=./mlruns
DATA_CACHE_DIR=./.hf_cache

Copy example to real file:

Windows:

Copy-Item .env.example .env -Force

macOS:

cp .env.example .env

configs/baseline.yaml— paste:

experiment_name: "t0_setup_template"
dataset: "imdb"         # options: imdb, ag_news
test_size: 0.2
random_state: 42
tfidf:
  max_features: 30000
  ngram_range: [1, 2]
model:
  type: "logreg"
  C: 2.0
  max_iter: 200
metrics:
  average: "macro"

Add the code files Create and paste the code below into files inside src/ 5.1 src/utils.py


import os, random, numpy as np

def set_all_seeds(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except Exception:
        # torch is optional; ignore if not installed
        pass

def get_env(name: str, default: str = "") -> str:
    from dotenv import load_dotenv
    load_dotenv()
    return os.getenv(name, default)

5.2 src/data.py


from datasets import load_dataset
import pandas as pd
from collections import Counter

def load_text_classification(name, cache_dir=None):
    """
    Loads a Hugging Face dataset and returns 3 DataFrames:
    train_df, valid_df (or None), test_df with columns: text, label
    """
    ds = load_dataset(name, cache_dir=cache_dir)
    train_df = pd.DataFrame(ds["train"])
    test_df  = pd.DataFrame(ds["test"])
    valid_df = pd.DataFrame(ds["validation"]) if "validation" in ds else None
    return train_df, valid_df, test_df

def describe_dataset(df, text_col="text", label_col="label"):
    lengths = df[text_col].astype(str).str.split().map(len)
    counts  = Counter(df[label_col])
    return {
        "rows": len(df),
        "avg_tokens": float(lengths.mean()),
        "median_tokens": float(lengths.median()),
        "label_counts": dict(counts),
    }

5.3 src/eda.py


# Quick, beginner-friendly EDA that saves pictures into reports/
import os, yaml
import matplotlib.pyplot as plt
from src.utils import set_all_seeds, get_env
from src.data import load_text_classification, describe_dataset

def main(cfg_path="configs/baseline.yaml"):
    set_all_seeds(42)
    cfg = yaml.safe_load(open(cfg_path))
    cache = get_env("DATA_CACHE_DIR", "./.hf_cache")

    train_df, valid_df, test_df = load_text_classification(cfg["dataset"], cache_dir=cache)

    # 1) Print simple stats
    print("TRAIN:", describe_dataset(train_df))
    if valid_df is not None:
        print("VALID:", describe_dataset(valid_df))
    print("TEST :", describe_dataset(test_df))

    # 2) Plot token length histogram (train)
    lengths = train_df["text"].astype(str).str.split().map(len)
    plt.figure()
    lengths.hist(bins=50)
    plt.xlabel("Tokens per example"); plt.ylabel("Count"); plt.title("Token Lengths (train)")
    os.makedirs("reports", exist_ok=True)
    plt.savefig("reports/eda_token_lengths.png", dpi=160, bbox_inches="tight")
    print("Saved: reports/eda_token_lengths.png")

if __name__ == "__main__":
    main()

5.4src/baseline.py


import os, yaml, mlflow, mlflow.sklearn
from src.utils import set_all_seeds, get_env
from src.data import load_text_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

def run_baseline(cfg_path="configs/baseline.yaml"):
    set_all_seeds(42)
    cfg = yaml.safe_load(open(cfg_path))

    mlflow.set_tracking_uri(get_env("MLFLOW_TRACKING_URI", "./mlruns"))
    mlflow.set_experiment(cfg["experiment_name"])

    train_df, valid_df, test_df = load_text_classification(
        cfg["dataset"], cache_dir=get_env("DATA_CACHE_DIR", "./.hf_cache")
    )

    if valid_df is None:
        train_df, valid_df = train_test_split(
            train_df, test_size=cfg["test_size"], random_state=cfg["random_state"], stratify=train_df["label"]
        )

    X_train, y_train = train_df["text"].astype(str), train_df["label"]
    X_valid, y_valid = valid_df["text"].astype(str), valid_df["label"]
    X_test,  y_test  = test_df["text"].astype(str),  test_df["label"]

    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(
            max_features=cfg["tfidf"]["max_features"],
            ngram_range=tuple(cfg["tfidf"]["ngram_range"])
        )),
        ("clf", LogisticRegression(
            C=cfg["model"]["C"],
            max_iter=cfg["model"]["max_iter"]
        ))
    ])

    with mlflow.start_run():
        # log params
        mlflow.log_params({
            "dataset": cfg["dataset"],
            "tfidf_max_features": cfg["tfidf"]["max_features"],
            "tfidf_ngram_range": str(cfg["tfidf"]["ngram_range"]),
            "model": cfg["model"]["type"],
            "C": cfg["model"]["C"],
            "max_iter": cfg["model"]["max_iter"],
            "random_state": cfg["random_state"]
        })

        pipe.fit(X_train, y_train)
        y_pred_valid = pipe.predict(X_valid)
        y_pred_test  = pipe.predict(X_test)

        # metrics
        acc_valid = accuracy_score(y_valid, y_pred_valid)
        f1_valid  = f1_score(y_valid, y_pred_valid, average=cfg["metrics"]["average"])
        acc_test  = accuracy_score(y_test, y_pred_test)
        f1_test   = f1_score(y_test, y_pred_test, average=cfg["metrics"]["average"])

        mlflow.log_metrics({
            "valid_accuracy": acc_valid,
            "valid_f1_macro": f1_valid,
            "test_accuracy": acc_test,
            "test_f1_macro": f1_test
        })

        # save confusion matrix
        os.makedirs("reports", exist_ok=True)
        fig = ConfusionMatrixDisplay.from_predictions(y_test, y_pred_test).figure_
        fig.savefig("reports/confusion_matrix.png", dpi=180, bbox_inches="tight")
        mlflow.log_artifact("reports/confusion_matrix.png")

        # save text report
        report = classification_report(y_test, y_pred_test)
        with open("reports/classification_report.txt", "w") as f:
            f.write(report)
        mlflow.log_artifact("reports/classification_report.txt")

        # save model
        mlflow.sklearn.log_model(pipe, artifact_path="model")

        print("Validation -> acc:", acc_valid, "f1_macro:", f1_valid)
        print("Test       -> acc:", acc_test,  "f1_macro:", f1_test)
        print("\nClassification report saved at reports/classification_report.txt")

if __name__ == "__main__":
    run_baseline()

5.5 (Optional)src/serve.py


from fastapi import FastAPI
from pydantic import BaseModel
import mlflow.pyfunc, glob

app = FastAPI(title="T0 Baseline Inference")

class InferRequest(BaseModel):
    text: str

def _latest_model_path():
    # look for the newest model saved by MLflow locally
    candidates = sorted(glob.glob("mlruns/*/*/artifacts/model"))
    if not candidates:
        raise RuntimeError("No model artifacts found. Run baseline first.")
    return candidates[-1]

@app.post("/infer")
def infer(payload: InferRequest):
    model = mlflow.pyfunc.load_model(_latest_model_path())
    pred = model.predict([payload.text])
    return {"label": int(pred[0])}

Run it — EDA → Baseline → MLflow 6.1 EDA (quick checks) Windows:

$env:MLFLOW_TRACKING_URI = ".\mlruns"
$env:DATA_CACHE_DIR = ".\.hf_cache"
python .\src\eda.py

macOS:

export MLFLOW_TRACKING_URI=./mlruns
export DATA_CACHE_DIR=./.hf_cache
python ./src/eda.py

You’ll see dataset stats printed and an image saved to reports/eda_token_lengths.png 6.2 Train the baseline + log to MLflow Windows:

python .\src\baseline.py
mlflow ui --backend-store-uri ".\mlruns" --host 127.0.0.1 --port 5000

macOS:

python ./src/baseline.py
mlflow ui --backend-store-uri ./mlruns --host 127.0.0.1 --port 5000

Open http://127.0.0.1:5000/ → you’ll see your run, parameters, metrics, and artifacts (confusion_matrix.png, classification_report.txt, model). Expectations: IMDB usually gets a solid accuracy with TF-IDF + Logistic Regression (often ~0.85–0.9). AG News will be lower/harder because it’s 4-class.

Optional: Run a tiny API for inference Start the server:

# macOS (zsh); on Windows, replace slashes with backslashes and use PowerShell
uvicorn src.serve:app --host 127.0.0.1 --port 8000 --reload

Test it (one example): macOS (curl):

curl -X POST http://127.0.0.1:8000/infer \
  -H "Content-Type: application/json" \
  -d '{"text":"A surprisingly heartfelt and funny movie."}'

Windows (PowerShell):

$body = @{ text = "A surprisingly heartfelt and funny movie." } | ConvertTo-Json
Invoke-RestMethod -Method Post -Uri http://127.0.0.1:8000/infer -ContentType "application/json" -Body $body

Switch dataset and re-run (practice) Change configs/baseline.yaml:

dataset: "ag_news"

Then re-run steps 6.1 and 6.2. Compare metrics in MLflow. This teaches that different tasks/datasets change difficulty and results.

Tiny glossary (for absolute beginners)

Token: a piece of text, usually a word.
TF-IDF: a way to turn text into numbers by counting words and down-weighting common ones.
Logistic Regression: a simple, reliable classifier.
Train / Validation / Test: train the model, tune it on validation, and report final scores on test.
Accuracy: how often predictions are correct.
Macro-F1: balances precision/recall across classes; good when classes are uneven.

Common Pitfalls & Troubleshooting

Install fails on Mac M-series: run python -m pip install --upgrade pip wheel setuptools and try again.
spaCy model error: run python -m spacy download en_core_web_sm.
MLflow UI empty: make sure you ran src/baseline.py before opening the UI.
No model found for API: run the baseline once to create a model artifact.

Additional Resources

Environment setup: Conda/Poetry, .env, fixed seeds.
Dataset load: download IMDB/AG News, verify splits, save schema in data/README.md.
EDA: class balance, token length distributions.
Baseline sanity: TF-IDF + Logistic Regression, log metrics table.
Experiment governance: config YAML for hyperparams, metrics logging to MLflow.
Reporting: generate reports/setup_template.md from notebook with nbconvert.
(Optional) Serve: demo FastAPI endpoint for inference with schema validation.

Hands-On Exercises

Try both datasets (IMDB vs AG News) and compare reproducibility logs.
Add noise (duplicates, shuffle seeds) to test determinism.
Run ablations: turn off seed fixing, compare reproducibility.
Stretch: connect MLflow run metadata to Weights & Biases.

Common Pitfalls & Troubleshooting

Forgetting to set seeds → non-reproducible results.
Data leakage from overlapping splits.
Unpinned dependencies breaking reproducibility.
Missing .env leads to secret path issues.
CI not running → unchecked notebook failures.

Best Practices

Always log commit hash, dataset version, config.
PR checklist: metrics ≥ baseline, README updated, tests green.
Write unit tests for tokenization, vectorization, and schema validation.
Keep seed-fixing utilities in src/utils.py.
Separate experiments (configs/notebooks) from reporting.

Reflection & Discussion Prompts

Why does reproducibility matter in civic-tech / applied NLP projects?
What’s the tradeoff between fast iteration and strict reproducibility?
How might governance differ in regulated vs open-data contexts?

Next Steps / Advanced Extensions

Automate report generation in CI.
Introduce containerized reproducibility (Docker).
Connect experiment tracking with deployment logs.
Move from IMDB/AG News to a civic dataset (e.g., 311 complaints).

Glossary / Key Terms

Reproducibility: ability to re-run experiment with identical results.
Data leakage: unintended information in train/test overlap.
Seed fixing: controlling randomness across frameworks.
Governance: tracking configs, metrics, and artifacts.

Additional Resources

[scikit-learn docs](https://scikit-learn.org/stable/)
[spaCy](https://spacy.io/)
[Hugging Face Transformers](https://huggingface.co/docs/transformers)
[Hugging Face Datasets](https://huggingface.co/datasets)
[MLflow](https://mlflow.org/)
[FastAPI](https://fastapi.tiangolo.com/)

Contributors

Author(s): TBD Reviewer(s): TBD Maintainer(s): TBD Date updated: 2025-09-20 Dataset license: IMDB (ACL), AG News (Creative Commons).

Issues Referenced

Epic: HfLA Text Analysis Tutorials (T0–T14). This sub-issue: T0 Setup & Template Walkthrough.

Notes: I chose IMDB (small binary classification) and AG News (medium 4-class classification) because they are light enough for setup/debug, yet distinct in size and task complexity. Both test the scaffolding under different load conditions. For governance, I leaned on MLflow for run-tracking (simpler than W&B but extensible). The FastAPI step is optional but sets the stage for later deployment tutorials.

Sep 20 '25 21:09 chinaexpert1

Progress: I have updated this sub-issue.
Blockers: No blockers, but I need time to execute this project entirely. Also, I would appreciate feedback on the content.
Availability: Available next week.
ETA: Assuming it will take 2 weeks.
Pictures (if necessary):

Sep 26 '25 01:09 sauravchangde