broca
broca copied to clipboard
rapid nlp prototyping
Broca
Various useful NLP algos and utilities
There is some Python 2 support scattered throughout but the library has not been fully tested against it.
This library is in development -- APIs may change and features may be unstable.
Overview
broca is a NLP library for experimenting with various approaches. So everything in this library is somewhat experimental and meant for rapid prototyping of NLP methods.
When I implement a new method, often from a paper or another source, I add it here so that it can be re-applied elsewhere.
Eventually I hope that broca can become a battery of experimental NLP methods which can easily be thrown at a new problem.
broca is structured like so:
common: misc utilities and classes reused across the whole library. Also includes shared objects.distance: for measuring string distance. This should probably be renamed though, since "distance" means a lot more than just string distance.tokenize: various tokenization methodskeyword: keyword-based tokenization methods (i.e. keyword extraction methods)
vectorize: various ways of representing documents as vectorssimilarity: various ways of computing similarityterm: for computing similarity between two termsdoc: for computing similarity matrices for sets of documents
preprocess: for preprocessing text, i.e. cleaningknowledge: tools for preparing or incorporating external knowledge sources, such as Wikipedia or IDF on auxiliary corporapipeline: for easily chainingbrocaclasses into pipelines - useful for rapid prototyping
Installation
broca is available through pypi, but the library is under active development, so it's recommended to install via git:
$ pip install git+ssh://[email protected]/ftzeng/broca.git
Or, if adding to a requirements.txt, add the line:
git+ssh://[email protected]/ftzeng/broca.git
If developing, you can clone the repo and from within the repo directory, install via pip:
$ pip install --editable .
Your installed version will be aliased directly from the repo directory, so changes are always immediately accessible.
You also need to install the spacy and nltk libraries' data:
$ python -m spacy.en.download
$ python -m nltk.downloader all
Usage
You can use broca's module conventionally, or you can take advantage of its pipelines:
from broca import Pipeline
from broca.preprocess import BasicCleaner, HTMLCleaner
from broca.vectorize import BoWVectorizer, DCSVectorizer
p = Pipeline(
HTMLCleaner(),
BasicCleaner(),
BoWVectorizer()
)
vecs = p(docs)
Pipelines allow you to chain broca's objects and easily swap them out.
Pipelines are validated upon creation to ensure that the outputs and inputs of adjacent components ("pipes") are compatible.
Multi-pipelines
You can also build multi-pipelines to try out a variety of pipelines simultaneously:
p = Pipeline(
HTMLCleaner(),
BasicCleaner(),
[BoWVectorizer(), DCSVectorizer()]
)
vecs1, vecs2 = p(docs)
This results in two pipelines which are run simultaneously when p(docs) is executed:
HTMLCleaner() -> BasicCleaner() -> BoWVectorizer()HTMLCleaner() -> BasicCleaner() -> DCSVectorizer()
Nesting pipelines
You can also nest pipelines and multi-pipelines:
clean = Pipeline(
HTMLCleaner(),
BasicCleaner(),
)
vectr_pipeline = Pipeline(
clean,
[BoWVectorizer(), DCSVectorizer()]
)
vecs1, vecs2 = p(docs)
Branching
Pipes can support input from multiple pipes or output to multiple pipes simultaneously.
Where multi-pipelines create distinct and separate pipelines, a branching pipeline is a singular pipeline where its inputs get mapped at branching segments and reduced afterwards.
Branches are specified as tuples.
Here's an example:
class A(Pipe):
input = Pipe.type.vals
output = Pipe.type.vals
def __call__(self, vals):
return [v+1 for v in vals]
class B(Pipe):
input = Pipe.type.vals
output = Pipe.type.vals
def __call__(self, vals):
return [v+2 for v in vals]
class C(Pipe):
input = Pipe.type.vals
output = Pipe.type.vals
def __call__(self, vals):
return [v+3 for v in vals]
class D(Pipe):
input = Pipe.type.vals
output = Pipe.type.vals
def __call__(self, vals):
return [v+4 for v in vals]
class E(Pipe):
input = (Pipe.type.vals, Pipe.type.vals, Pipe.type.vals)
output = Pipe.type.vals
def __call__(self, vals1, vals2, vals3):
return [sum([v1,v2,v3]) for v1,v2,v3 in zip(vals1,vals2,vals3)]
branching_pipeline = Pipeline(
A(),
(B(), C(), D()) # A branching segment
(B(), C(), D()) # Another branching segment
E() # Reduced
)
branching_pipeline([1,2,3,4])
# [24,27,30,33]
Whatever follows a branching segment must accept multiple inputs - this could be a single Pipe or another branching segment of equal size.
Alternatively, the A Pipe in the example could have had its output defined as a tuple:
class A(Pipe):
input = Pipe.type.vals
output = (Pipe.type.vals, Pipe.type.vals, Pipe.type.vals)
def __call__(self, vals):
return [v+1 for v in vals], [v+2 for v in vals], [v+3 for v in vals]
# Running the above with A defined as such returns [27, 30, 33, 36] instead.
Freezing pipes
By default, pipelines are frozen - that is, each pipe's output memoized to disk based on the inputs it receives. If the input changes or the pipe's __call__ method is redefined, its output will be recomputed; otherwise, it will be loaded from disk. This means you can easily swap out components in a pipeline without needing to redundantly recompute parts which are not affected.
You can disable this behavior for a pipeline by specifying freeze=False:
p = Pipeline(
HTMLCleaner(),
BasicCleaner(),
freeze=False
)
You can force the recomputation of an entire pipeline by specifying refresh=True:
p = Pipeline(
HTMLCleaner(),
BasicCleaner(),
refresh=True
)
Implementing a pipe
Implementing your own pipeline component is easy. Just define a class which inherits from broca.pipeline.Pipe and define its __call__ method and input and output class attributes, which should be from Pipe.type.
The call method must take only two arguments: self and then the input from the preceding pipe. If there are parameters to be specified, they should be handled in the pipe's __init__ method.
from broca import Pipe
class MyPipe(Pipe):
input = Pipe.type.docs
output = Pipe.type.vecs
def __init__(self, some_param):
self.some_param = some_param
def __call__(self, docs):
# do something with docs to get vectors
vecs = make_vecs_func(docs, self.some_param)
return vecs
The default __init__ method saves the initialization args in self.args and kwargs as properties by their key names, so you won't need to implement __init__ if you only need it to pass arguments to __call__.
You can use anything for your input and output pipe types, e.g. Pipe.type.foo or Pipe.type.hello_there. They are dynamically generated as needed.
The Identity Pipe
Sometimes you need a pipe to pass along input unmodified.
For example, the WikipediaSimilarity pipe takes in as input (Pipe.type.docs, Pipe.type.tokens). You want to pass docs and tokenized versions of those docs.
This can be accomplished with branching and the IdentityPipe, which requires you specify the input pipe type:
docs = [
'I am a cat',
'I have a lion'
]
p = Pipeline(
(IdentityPipe(Pipe.type.docs), OverkillTokenizer()),
WikipediaSimilarity()
)
Parallel processing
Some pipes, including OverkillTokenizer, RakeTokenizer, BasicCleaner, and HTMLCleaner), have support for parallel processing (multiprocessing) which is activated by specifying the n_jobs keyword argument as something other than 0, e.g. RakeTokenizer(n_jobs=4). If n_jobs is less than 1, the total number of cores minus that value will be used. For instance, if n_jobs=-1 and your machine has 4 cores, then 3 cores will be used.
There is a bit of an overhead to setup multiprocessing; the performance gains are only seen with larger amounts of data. There's also an additional memory cost for each separate process, so keep in mind that there is a speed/memory trade-off.
Examples
There are a few usage examples in the examples directory.
Tests
Unit tests can be run using nose:
$ nosetests tests