data-science icon indicating copy to clipboard operation
data-science copied to clipboard

Job trends in software development by term frequency

Open chinaexpert1 opened this issue 4 months ago • 0 comments

Overview

Analyze six-month shifts in software development job postings by comparing March 2025 (six months ago) to the most recent 30 days, quantifying what skills/technologies are rising, falling, newly in demand, or disappearing. Outputs include frequency deltas, trend lines, and a concise narrative report that’s reproducible and portfolio-ready.

The assignee for this issue will have to wear three hats:

  1. A Data Scientist - Responsible for posing/framing the question(s) and adapting the inquiry according to evidence discovered. An initial hypothes(e)s posed in a report from some intial motivations and investigations. Adapt that hypothesis as insights come in. Create the Final Report for peer review.
  2. A Data Engineer - Cleaning, Prepping data after devising a strategy to scrape old posting via an API. Implement the scraping and shape the data for visualization by the Data Analyst.
  3. A Data Analyst - Responsible for answering the Data Scientist's hypotheses, providing a report to them using the insights from the shaped data. Reports to generate along the way.

Action Items

The beginning (research + design)

  • Define cohort & windows: titles containing software, developer, engineer, frontend, backend, full-stack, mobile, devops, data engineer, ml engineer; past window = 2025-03-01→2025-03-31, current window = last 30 days.

  • Select sources:

    • Federal: USAJOBS HistoricJoa + AnnouncementText (full text, closed postings).
    • Private sector (optional v1.1): archived postings from Greenhouse/Lever via Wayback snapshots.
  • Draft skills/tech lexicon with preferred mappings (e.g., whitelist→allowlist, primary/replica vs master/slave; frameworks: React/Vue/Svelte; platforms: Kubernetes, Terraform; languages: Rust/Go/TypeScript; AI terms: LLM, RAG, vector DB).

  • Methods spec: text extraction & normalization, n-gram and phrase detection, log-odds with informative Dirichlet prior for rise/fall, monthly slope estimation, multiple-testing control (BH-FDR).

  • Ethics & QA: de-duplicate pages, ignore code blocks, document gaps (unarchived pages), and record rate-limit handling.

If researched and ready (implementation steps)

  1. Ingest

    • Pull metadata with GET /api/HistoricJoa for both windows; join long fields with GET /api/HistoricJoa/AnnouncementText (key: USAJOBSControlNumber). ([developer.usajobs.gov]1)
    • (Optional) Enumerate private-sector snapshots via Wayback CDX; select nearest capture to target dates. ([Internet Archive]2)
  2. Extract & Normalize

    • HTML→main text via trafilatura (fallback: Readability/jusText), lowercase, strip boilerplate & code blocks, Unicode normalize. ([trafilatura.readthedocs.io]3)
  3. Feature Build

    • Tokenize; build unigrams/bigrams; apply skill lexicon + regex for variants (e.g., Type-Script|TypeScript); map deprecated→preferred terms.
  4. Analysis

    • Compute term frequencies, ΔTF-IDF, and log-odds rise/fall between windows; estimate monthly slope on multi-month series if expanded.
    • Tag new (present now, absent before) and gone (present before, absent now) terms; flag ambiguous terms for manual review.
  5. Deliverables

    • CSVs: term_stats_past.csv, term_stats_current.csv, trend_report.csv.
    • Dashboard (Streamlit or Altair/Vega-Lite): top rising/falling skills, new vs gone, per-title breakdown, sample snippets.
    • Methods README with caveats, rate-limit notes, and reproducibility (fixed seeds).
  6. Testing & Ops

    • Add unit tests for parsing & counting; retries/backoff; de-duplication via URL+digest; logs with timings and error contexts.
    • Schedule optional monthly refresh via GitHub Actions (artifact publish).

Resources/Instructions

  • USAJOBS APIs (free, historical)

  • Internet Archive / Wayback (for expired private-sector postings)

    • CDX Server API reference & help pages. ([Internet Archive]2)
    • Python clients: waybackpy or edgi/wayback. ([PyPI]5)
  • Extraction

  • Schema hints (when scraping non-federal sites)

    • schema.org/JobPosting and Google’s Job Posting structured data guide. ([Schema.org]6)
  • Suggested stack (choose 1 of each pair)

    • HTTP: httpx or requests
    • Parsing: trafilatura or readability-lxml
    • NLP/Features: scikit-learn or textacy
    • Stats: statsmodels or scipy
    • Viz/Dashboard: Altair/Vega-Lite or Streamlit
    • Archival: waybackpy or edgi/wayback
  • Deliverable spec

    • Repo with fetch/, extract/, analyze/, report/ modules; config.yaml for windows & filters; CI workflow to run end-to-end and publish CSVs + HTML report.
  • If this issue requires access to 311 data, please answer:

    • Not applicable. This project uses USAJOBS public APIs and public web archives only.

chinaexpert1 avatar Sep 11 '25 23:09 chinaexpert1