Job trends in software development by term frequency

Open chinaexpert1 opened this issue 4 months ago • 0 comments

Overview

Analyze six-month shifts in software development job postings by comparing March 2025 (six months ago) to the most recent 30 days, quantifying what skills/technologies are rising, falling, newly in demand, or disappearing. Outputs include frequency deltas, trend lines, and a concise narrative report that’s reproducible and portfolio-ready.

The assignee for this issue will have to wear three hats:

A Data Scientist - Responsible for posing/framing the question(s) and adapting the inquiry according to evidence discovered. An initial hypothes(e)s posed in a report from some intial motivations and investigations. Adapt that hypothesis as insights come in. Create the Final Report for peer review.
A Data Engineer - Cleaning, Prepping data after devising a strategy to scrape old posting via an API. Implement the scraping and shape the data for visualization by the Data Analyst.
A Data Analyst - Responsible for answering the Data Scientist's hypotheses, providing a report to them using the insights from the shaped data. Reports to generate along the way.

Action Items

The beginning (research + design)

Define cohort & windows: titles containing software, developer, engineer, frontend, backend, full-stack, mobile, devops, data engineer, ml engineer; past window = 2025-03-01→2025-03-31, current window = last 30 days.
Select sources:
- Federal: USAJOBS HistoricJoa + AnnouncementText (full text, closed postings).
- Private sector (optional v1.1): archived postings from Greenhouse/Lever via Wayback snapshots.
Draft skills/tech lexicon with preferred mappings (e.g., whitelist→allowlist, primary/replica vs master/slave; frameworks: React/Vue/Svelte; platforms: Kubernetes, Terraform; languages: Rust/Go/TypeScript; AI terms: LLM, RAG, vector DB).
Methods spec: text extraction & normalization, n-gram and phrase detection, log-odds with informative Dirichlet prior for rise/fall, monthly slope estimation, multiple-testing control (BH-FDR).
Ethics & QA: de-duplicate pages, ignore code blocks, document gaps (unarchived pages), and record rate-limit handling.

If researched and ready (implementation steps)

Ingest
- Pull metadata with GET /api/HistoricJoa for both windows; join long fields with GET /api/HistoricJoa/AnnouncementText (key: USAJOBSControlNumber). ([developer.usajobs.gov]1)
- (Optional) Enumerate private-sector snapshots via Wayback CDX; select nearest capture to target dates. ([Internet Archive]2)
Extract & Normalize
- HTML→main text via trafilatura (fallback: Readability/jusText), lowercase, strip boilerplate & code blocks, Unicode normalize. ([trafilatura.readthedocs.io]3)
Feature Build
- Tokenize; build unigrams/bigrams; apply skill lexicon + regex for variants (e.g., Type-Script|TypeScript); map deprecated→preferred terms.
Analysis
- Compute term frequencies, ΔTF-IDF, and log-odds rise/fall between windows; estimate monthly slope on multi-month series if expanded.
- Tag new (present now, absent before) and gone (present before, absent now) terms; flag ambiguous terms for manual review.
Deliverables
- CSVs: term_stats_past.csv, term_stats_current.csv, trend_report.csv.
- Dashboard (Streamlit or Altair/Vega-Lite): top rising/falling skills, new vs gone, per-title breakdown, sample snippets.
- Methods README with caveats, rate-limit notes, and reproducibility (fixed seeds).
Testing & Ops
- Add unit tests for parsing & counting; retries/backoff; de-duplication via URL+digest; logs with timings and error contexts.
- Schedule optional monthly refresh via GitHub Actions (artifact publish).

Resources/Instructions

USAJOBS APIs (free, historical)
- API overview & endpoints (Search, Historic JOAs, Announcement Text). ([developer.usajobs.gov]4)
Internet Archive / Wayback (for expired private-sector postings)
- CDX Server API reference & help pages. ([Internet Archive]2)
- Python clients: waybackpy or edgi/wayback. ([PyPI]5)
Extraction
- trafilatura docs & quickstart; evaluation notes. ([trafilatura.readthedocs.io]3)
Schema hints (when scraping non-federal sites)
- schema.org/JobPosting and Google’s Job Posting structured data guide. ([Schema.org]6)
Suggested stack (choose 1 of each pair)
- HTTP: httpx or requests
- Parsing: trafilatura or readability-lxml
- NLP/Features: scikit-learn or textacy
- Stats: statsmodels or scipy
- Viz/Dashboard: Altair/Vega-Lite or Streamlit
- Archival: waybackpy or edgi/wayback
Deliverable spec
- Repo with fetch/, extract/, analyze/, report/ modules; config.yaml for windows & filters; CI workflow to run end-to-end and publish CSVs + HTML report.
If this issue requires access to 311 data, please answer:
- Not applicable. This project uses USAJOBS public APIs and public web archives only.

Sep 11 '25 23:09 chinaexpert1