knyfe
knyfe copied to clipboard
knyfe is a python utility for rapid exploration of datasets.
What is knyfe?
knyfe is a python utility for rapid exploration of datasets. Use it when you have some kind of dataset and you want to get a feel for how it is composed, run some simple tests on it, or prepare it for further processing. The great thing about knyfe is that you don't have to know much about how your dataset is designed. You shouldn't have to remember in which variable resides in which column of your data matrix or how your structs are nested. Just get shit done.

Quickstart
knyfe is awesome on it's own, but it's really good friends with the iPython console. Just fire it up with ipython qtconsole --pylab=inline and get rockin':
>>> cereals = knyfe.Data("examples/cereals.json")
>>> print cereals.summary
Unnamed Dataset (75 samples)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
rating : 18.04 - 93.70 Mean: 42.59 +- 14.05
potass : 15.00 - 330.00 Mean: 99.25 +- 70.74 (missing in 2 samples)
fiber : 0.00 - 320.00 Mean: 161.27 +- 82.20
vitamins : 0.00 - 100.00 Mean: 28.33 +- 22.48
name : [Mueslix Crispy ...]
weight : 0.50 - 1.50 Mean: 1.03 +- 0.15
sodium : 0.00 - 5.00 Mean: 1.01 +- 1.01
shelf : 1 - 3
sugars : 5.00 - 23.00 Mean: 14.77 +- 3.93 (missing in 1 samples)
calories : 50 - 160
fat : 1.00 - 6.00 Mean: 2.53 +- 1.09
protein : 1.00 - 6.00 Mean: 2.53 +- 1.09
cups : 0.25 - 1.50 Mean: 0.82 +- 0.23
type : [cold, hot]
carbo : 0.00 - 14.00 Mean: 2.20 +- 2.38
manufacturer : [Kelloggs, Nabis...]
==================================================================================
>>> print set(cereals.manufacturer)
set(['Kelloggs', 'Nabisco', 'Ralston Purina', 'Quaker Oats', 'Post', 'General Mills'])
>>> kellogs_products = cereals.filter(manufacturer="Kellogs")
>>> hist(kellogs_products.sugars)

>>> kellogs_products.export("kellogs.xls")
Loading Data
Data objects can be created using
- Strings, interpreted as paths to JSON files
- dictionaries, interpreted as single samples
- lists of dictionaries
- other Data instances
So any of these will work:
cereals = knyfe.Data("examples/cereals.json")
all_examples = knyfe.Data("examples/*.json")
bruce = knyfe.Data({"name": "Bruce Schneier", "awesomeness": 8.7})
people = knyfe.Data([
{"name": "Justin Bieber", "awesomeness": 1.3},
{"name": "Nikola Tesla", "awesomeness": 9.8}
])
copy_of_singleton = knyfe.Data(singleton)
Exploring Data
At any time, you can print the summary of a data set to get a quick peek into what's inside:
>>> print people.summary
Unnamed Dataset (2 samples)
''''''''''''''''''''''''''''''''''''''''''''''''''''''
awesomeness : 1.30 - 9.80 Mean: 5.55 +- 4.25
name : [Nikola Tesla, ...]
======================================================
attributes will give you all attributes in a dataset:
>>> print people.attributes
set(['awesomeness', 'name'])
You can access the values of an attribute using the get method, or the shorthand .-notation:
>>> print people.get("awesomeness")
[ 1.3, 9.8]
>>> print people.awesomeness
[ 1.3, 9.8]
Note that while get works on any attribute, the dot-notation requires attributes to look like valid python variables. In any case, the values returned will be a numpy-array. Note that if there are samples with missing values, the returned array will be shorter than the data set itself. You can tell get to replace missing values, though:
>>> people += {"name": "The Yeti"}
>>> print people.get("awesomeness")
[ 1.3, 9.8]
>>> people.get("awesomeness", missing=NaN)
[ 1.3, 9.8, nan]
Manipulating Data
Adding Data, Unions and Differences
The + and - operators work as expected:
>>> yeti = {"name": "The Yeti"}
>>> people += yeti # Adds 1 sample to people (now 3)
>>> more_people = people + bruce # Creates new Dataset with 4 samples
>>> real_people = more_people - yeti # Creates new Dataset with Bruce, Nikoalai and Justin
Filtering
But the real awesomeness happens in filter. Back to our cereals:
>>> cereals.filter(manufacturer="Kellogs")
Will return a data set with only those samples from cereals where manufacturer is Kellogs.
>>> cereals.filter(shelf=(2,3))
will get all cereals with shelf being either 2 or 3, and
>>> cereals.filter("sugars")
will get all samples where the sugars attribute is present and does not evaluate to False (ie. is not NaN or 0). You can also filter by an array of booleans, which is very handy for situations like this:
>>> cereals.filter(cereals.calories > 60)
Note that in this case cereals.calories must not have any missing values, because then cereals.calories > 60 would be shorter than data itself. In such a case, you can use cereals.get("calories", missing=NaN) > 60 (samples with calories missing will not be part of the filtered dataset this way.) But you can also use any arbitrary filter like this:
>>> cereals.filter(lambda c: 12.0 <= c['sugars'] < 15.0)
gets all the cereals that have between 12 and 15 grams of sugar.
Daisy-chaining
Since filter returns a new data set, you can also chain methods:
>>> cereals.filter(manufacturer="Kellogs").filter(shelf=(2,3))
Of course, you can also write
>>> cereals.filter(manufacturer="Kellogs", shelf=(2,3))
and get the same effect - but chaining methods allows you to do a few other operations in a single line.
Other functions:
mapmedian_splittoggle_verboseremove_outlierslabeldependent_vars
Saving and Exporting
Saving to json is as easy as
cereals.save("new_dataset.json")
But exporting is just as swift:
cereals.save("excel_worksheet.xlsx")
knyfe will guess the format by the extension.
Formats
Currently following formats are supported.
csvfor comma separated valuexlsxfor Excel 07 or newerxlsfor legacy Excelodsfor open document spreadsheethtmlfor an html file
Native Datasets: JSON
Natively, knyfe treats data like JSON objects, or, key value pairs. If you know what JSON is, skip this section.
Why JSON?
Any data format should be constructed after three principles:
- Human readable
- Explict (ie. self-contained)
- Flexible
In other words, a dataset shouldn't look like this: PK\x03\x04\x14\x00\x00\x00\x00\x00\xce\xad and it also shouldn't look like 5.1,3.5,1.4,0.2;4.6,3.1,1.5,0.2. Why? For two reasons:
- If other people want to use your data, the should know what they're dealing with.
- Human readable means anybody will be able to open the data set, now and in 50 years.
What does JSON look like?
If you know Python, JSON will look very familiar: it translates to Python dict and list types almost directly. The only difference is that None in Python is null in JSON, and keys don't have to be strings. So a Dataset in JSON may look like this:
[
{
species: 'Elephant',
weight: 8014.2,
age: 31,
name: 'Dumbo'
},
{
species: 'Squirrel',
weight: 0.021,
age: .7,
name: null
}
]