I am seeing a ~100ms overhead to process any document (text), which seems like it can't be correct given the performance data listed for large corpuses. I've been porting a large Python/NTLK app to Go, but my legacy Python/NLTK text parsing is running ~250x faster than my new go/prose implementation, which seems like I must be doing something wrong.

Did some external dependency break the performance of prose, or am I using the API incorrectly?

Simple performance test (Go 1.18)

Here's a simple test that processes the same short sentence twice. Both executions take ~100ms.

Code:

package main

import (
	"fmt"
	"time"

	"github.com/jdkato/prose/v2"
)

var text = "This is a simple test."

func main() {
	for i := 0; i < 2; i++ {
		start := time.Now()
		doc, err := prose.NewDocument(
			text,
			prose.WithExtraction(false),
			prose.WithSegmentation(false))
		duration := time.Since(start)
		fmt.Println(duration)
		if err != nil {
			panic(err)
		}

		// Iterate over the doc's tokens:
		fmt.Print("   ")
		for _, tok := range doc.Tokens() {
			fmt.Printf("(%v, %v)  ", tok.Text, tok.Tag)
		}
		fmt.Println()
	}
}

Output:

$ go run .
118.549243ms
   (This, DT)  (is, VBZ)  (a, DT)  (simple, JJ)  (test, NN)  (., .)  
117.214746ms
   (This, DT)  (is, VBZ)  (a, DT)  (simple, JJ)  (test, NN)  (., .)  
$

Comparison test using NLTK in Python (3.8)

When I run the same test using NLTK in Python, the first document processed also has ~100ms of overhead, but all subsequent documents are processed very quickly (~400usec in the example below):

Sample code:

#!/usr/bin/env python
import nltk
from datetime import datetime

text = "This is a simple test."

for _ in range(2):
    start = datetime.now()
    raw_tokens = nltk.word_tokenize(text)
    pos_tokens = nltk.pos_tag(raw_tokens)
    duration = datetime.now() - start
    print(duration)
    print(f'   {pos_tokens}')

Output:

$ ./test-nltk.py 
0:00:00.092738
   [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('test', 'NN'), ('.', '.')]
0:00:00.000415
   [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('test', 'NN'), ('.', '.')]
$

Apr 11 '22 16:04 epmoyer

Answering my own question...

It turns out that the 100ms overhead comes from the creation of the (internal) Model instance. If one instead passes a Model to .NewDocument() (e.g. with prose.UsingModel(model))) then the overhead disappears. In my sample test, the execution time (*Document object creation time) for a simple sentence dropped from ~100ms to ~50us (i.e. a 2000x performance improvement).

That means that if you are going to be calling .NewDocument() repeatedly for each document within a large corpus of text documents then you should be creating the Model once and passing it to each .NewDocument() call. That wasn't obvious to me at all from the documentation, and ought to be incorporated in the README somewhere.

Under the hood .NewDocument() calls defaultModel() if no Model was passed to it, but because defaultModel() is a private method there doesn't appear to be a convenient way to pre-create a default Model object so that you can use it in subsequent calls. In my test code below I first call .NewDocument() without passing a Model, then "steal" the model from the returned *Document object (it is exposed as .Model), and then pass that same Model to the second call to .NewDocument().

In my real project I plan to just call .NewDocument() at initialization time (passing it a single word of dummy text) so that I can then extract the .Model object and use it for all subsequent calls.

Revised test code:

package main

import (
	"fmt"
	"time"

	"github.com/jdkato/prose/v2"
)

var text = "This is a simple test."
var model *prose.Model

func main() {
	for i := 0; i < 2; i++ {
		var doc *prose.Document
		var err error
		start := time.Now()
		if model == nil {
			// Our first call will have no model, so it will create one
			doc, err = prose.NewDocument(
				text,
				prose.WithExtraction(false),
				prose.WithSegmentation(false))
		} else {
			// Our second call will re-use the model created by the first call
			doc, err = prose.NewDocument(
				text,
				prose.WithExtraction(false),
				prose.WithSegmentation(false),
				prose.UsingModel(model))
		}
		duration := time.Since(start)
		fmt.Println(duration)
		if err != nil {
			panic(err)
		}

		// Iterate over the doc's tokens:
		fmt.Print("   ")
		for _, tok := range doc.Tokens() {
			fmt.Printf("(%v, %v)  ", tok.Text, tok.Tag)
		}
		fmt.Println()

		// Steal the model, to use on our second call
		if model == nil {
			model = doc.Model
		}
	}
}

Revised output:

$ go run .
121.068976ms
   (This, DT)  (is, VBZ)  (a, DT)  (simple, JJ)  (test, NN)  (., .)  
50.154µs
   (This, DT)  (is, VBZ)  (a, DT)  (simple, JJ)  (test, NN)  (., .) 
$

Apr 11 '22 18:04 epmoyer

@epmoyer thanks for sharing the solution. Caching the model resulted in at least a >4x improvement in my project. Seems like Prose can cache the defaultModel without any issues.

May 01 '22 23:05 schahriar

Seeing ~100ms overhead per doc. Did performance break, or am I using the API incorrecty?

Simple performance test (Go 1.18)

Comparison test using NLTK in Python (3.8)

Answering my own question...