Job trends in software development by term frequency
Overview
Analyze six-month shifts in software development job postings by comparing March 2025 (six months ago) to the most recent 30 days, quantifying what skills/technologies are rising, falling, newly in demand, or disappearing. Outputs include frequency deltas, trend lines, and a concise narrative report that’s reproducible and portfolio-ready.
The assignee for this issue will have to wear three hats:
- A Data Scientist - Responsible for posing/framing the question(s) and adapting the inquiry according to evidence discovered. An initial hypothes(e)s posed in a report from some intial motivations and investigations. Adapt that hypothesis as insights come in. Create the Final Report for peer review.
- A Data Engineer - Cleaning, Prepping data after devising a strategy to scrape old posting via an API. Implement the scraping and shape the data for visualization by the Data Analyst.
- A Data Analyst - Responsible for answering the Data Scientist's hypotheses, providing a report to them using the insights from the shaped data. Reports to generate along the way.
Action Items
The beginning (research + design)
-
Define cohort & windows: titles containing
software,developer,engineer,frontend,backend,full-stack,mobile,devops,data engineer,ml engineer; past window = 2025-03-01→2025-03-31, current window = last 30 days. -
Select sources:
- Federal: USAJOBS HistoricJoa + AnnouncementText (full text, closed postings).
- Private sector (optional v1.1): archived postings from Greenhouse/Lever via Wayback snapshots.
-
Draft skills/tech lexicon with preferred mappings (e.g.,
whitelist→allowlist,primary/replicavsmaster/slave; frameworks: React/Vue/Svelte; platforms: Kubernetes, Terraform; languages: Rust/Go/TypeScript; AI terms: LLM, RAG, vector DB). -
Methods spec: text extraction & normalization, n-gram and phrase detection, log-odds with informative Dirichlet prior for rise/fall, monthly slope estimation, multiple-testing control (BH-FDR).
-
Ethics & QA: de-duplicate pages, ignore code blocks, document gaps (unarchived pages), and record rate-limit handling.
If researched and ready (implementation steps)
-
Ingest
- Pull metadata with
GET /api/HistoricJoafor both windows; join long fields withGET /api/HistoricJoa/AnnouncementText(key:USAJOBSControlNumber). ([developer.usajobs.gov]1) - (Optional) Enumerate private-sector snapshots via Wayback CDX; select nearest capture to target dates. ([Internet Archive]2)
- Pull metadata with
-
Extract & Normalize
- HTML→main text via trafilatura (fallback: Readability/jusText), lowercase, strip boilerplate & code blocks, Unicode normalize. ([trafilatura.readthedocs.io]3)
-
Feature Build
- Tokenize; build unigrams/bigrams; apply skill lexicon + regex for variants (e.g.,
Type-Script|TypeScript); map deprecated→preferred terms.
- Tokenize; build unigrams/bigrams; apply skill lexicon + regex for variants (e.g.,
-
Analysis
- Compute term frequencies, ΔTF-IDF, and log-odds rise/fall between windows; estimate monthly slope on multi-month series if expanded.
- Tag new (present now, absent before) and gone (present before, absent now) terms; flag ambiguous terms for manual review.
-
Deliverables
- CSVs:
term_stats_past.csv,term_stats_current.csv,trend_report.csv. - Dashboard (Streamlit or Altair/Vega-Lite): top rising/falling skills, new vs gone, per-title breakdown, sample snippets.
- Methods README with caveats, rate-limit notes, and reproducibility (fixed seeds).
- CSVs:
-
Testing & Ops
- Add unit tests for parsing & counting; retries/backoff; de-duplication via URL+digest; logs with timings and error contexts.
- Schedule optional monthly refresh via GitHub Actions (artifact publish).
Resources/Instructions
-
USAJOBS APIs (free, historical)
- API overview & endpoints (Search, Historic JOAs, Announcement Text). ([developer.usajobs.gov]4)
-
Internet Archive / Wayback (for expired private-sector postings)
- CDX Server API reference & help pages. ([Internet Archive]2)
- Python clients: waybackpy or edgi/wayback. ([PyPI]5)
-
Extraction
- trafilatura docs & quickstart; evaluation notes. ([trafilatura.readthedocs.io]3)
-
Schema hints (when scraping non-federal sites)
-
schema.org/JobPostingand Google’s Job Posting structured data guide. ([Schema.org]6)
-
-
Suggested stack (choose 1 of each pair)
- HTTP:
httpxorrequests - Parsing:
trafilaturaorreadability-lxml - NLP/Features:
scikit-learnortextacy - Stats:
statsmodelsorscipy - Viz/Dashboard:
Altair/Vega-LiteorStreamlit - Archival:
waybackpyoredgi/wayback
- HTTP:
-
Deliverable spec
- Repo with
fetch/,extract/,analyze/,report/modules;config.yamlfor windows & filters; CI workflow to run end-to-end and publish CSVs + HTML report.
- Repo with
-
If this issue requires access to 311 data, please answer:
- Not applicable. This project uses USAJOBS public APIs and public web archives only.