Data-Science-45min-Intros icon indicating copy to clipboard operation
Data-Science-45min-Intros copied to clipboard

Materials for our team teaching+learning sessions around CS, ML, stats, and related data science topics. Intended to take ~45 minutes, mostly in narrative IPython notebooks.

Data Science 45-min Intros

Every week*, our data science team @Gnip (aka @TwitterBoulder) gets together for about 50 minutes to learn something.

While these started as opportunities to collectively "raise the tide" on common stumbling blocks in data munging and analysis tasks, they have since grown to machine learning, statistics, and general programming topics. Anything that will help us do our jobs better is fair game.

For each session, someone puts together the lesson/walk-through and leads the discussion. Presentation platforms commonly include well-written READMEs, IPython notebooks, knitr documents, interactive code sessions... the more hands-on, the better.

Feel free to use these for your own (or your team's) growth, and do submit pull requests if you have something to add.

*ok, while we try to do it every week, sometimes it doesn't happen. In that case, we try to guilt trip the person who slacked.

Current topics

Python

  • Object oriented programming concepts + modules/packaging

  • Unit testing with unittest

  • Iterators + Generators

  • Introduction to pandas

  • Introduction to Vertica with vertica_python

  • Introduction to multiprocessing

  • Python decorators

  • Python Interfaces

  • Python logging

Bash + command-line tools

  • Using jq

  • Bash data structures

  • Regular expressions

Statistics

  • Maximum Likelihood Estimation

  • Count-Min algorithm

  • A/B Testing

  • Causal inference

  • Error statistics

  • Classical statistics applied to social data

  • Meaningful comparisons of ordered lists

  • Counting and Maximum Likelihood Estimation

  • Estimating the number of classes in a population

  • Long Tail Distributions I

  • Long Tail Distributions II

  • Maximum Likelihood Parameter Estimation

  • Probabilty graph models

Machine Learning

  • Intro to scikit-learn

  • Introduction to K-means clustering

  • Choosing k in k-means clustering

  • Logistic Regression

  • Naive Bayes Classifier

  • Introduction to kNN

  • Introduction to AdaBoost

  • Decision Trees

  • Basis expansions + kernels

  • Model selection

  • Introduction to SVM

  • Text Mining with sklearn

  • Bandit Algorithms

  • Kernel smoothing

  • Neural Networks I

  • Neural Networks II

Natural Langugage Processing

  • Intro to topic modeling

  • More on topic modeling & a practical example

  • Part of speech tagging

  • Text processing

  • Word vector spaces

Network structure

  • Network statistics + igraph

  • Network analysis: using null models

  • Network analysis: community structures

  • Network analysis: centrality metrics

Algorithms

  • Count min sketch

Engineering

  • Refactoring

Geographic Information Systems

  • Shapefile utilties + reverse geo coding (and Makefile)

Web development

  • Websockets

  • Python + Flask basics

Visualization

  • D3 and Javascript Intro

  • D3 reusable charts: Heatmap

  • Real Time Data - Websockets Intro

  • Introduction to horizon charts

  • Bokeh

  • Matplotlib - Graphing for science in Python

Databases

  • SQL 201 - script-based data and queries

  • Vertica