oaqa/FlexNeuART: Flexible classic and NeurAl Retrieval Toolkit

FlexNeuART (flex-noo-art)

Flexible classic and NeurAl Retrieval Toolkit, or shortly FlexNeuART (intended pronunciation flex-noo-art) is a substantially reworked knn4qa package. The overview can be found in our EMNLP OSS workshop paper: Flexible retrieval with NMSLIB and FlexNeuART, 2020. Leonid Boytsov, Eric Nyberg.

In Aug-Dec 2020, we used this framework to generate best traditional and/or neural runs in the MSMARCO Document ranking task. In fact, our best traditional (non-neural) run slightly outperformed a couple of neural submissions. Please, see our write-up for details: Boytsov, Leonid. "Traditional IR rivals neural models on the MS MARCO Document Ranking Leaderboard." arXiv preprint arXiv:2012.08020 (2020).

Regretffully, for adminstrative and licensing/patenting issues (there is a patent submitted), neural Model 1 code cannot be released. This model (together with its non-contextualized variant) is described and evaluated in our ECIR 2021 paper: Boytsov, Leonid, and Zico Kolter. "Exploring Classic and Neural Lexical Translation Models for Information Retrieval: Interpretability, Effectiveness, and Efficiency Benefits." ECIR 2021.

In terms of pure effectiveness on long documents, other models (CEDR & PARADE) seem to be perform equally well (or somewhat better). They are available in our codebase. We are not aware of the patents inhibiting the use of the traditional (non-neural) Model 1.

Objectives

Develop & maintain a (relatively) light-weight modular middleware useful primarily for:

Research
Education
Evaluation & leaderboarding

Main features

Dense, sparse, or dense-sparse retrieval using Lucene and NMSLIB.
Multi-field multi-level forward indices (+parent-child field relations) that can store parsed and "raw" text input as well as sparse and dense vectors.
Forward indices can be created in append-only mode, which requires much less RAM.
Pluggable generic rankers (via a server)
SOTA neural (CEDR, PARADE, BERT FirstP/MaxP/Sum) and non-neural models (multi-field BM25, IBM Model 1).
Multi-GPU training and inference with out-of-the box support for ensembling
Basic experimentation framework (+LETOR)
Python API to use retrievers and rankers as well as to access indexed data.

Documentation

Usage notebooks covering installation & most functionality (including experimentation and Python API demo)
Legacy notebooks for MS MARCO and Yahoo Answers
Former life (as a knn4qa package), including acknowledgements and publications

We support a number of neural BERT-based ranking models as well as strong traditional ranking models including IBM Model 1 (description of non-neural rankers to follow).

The framework supports data in generic JSONL format. We provide conversion (and in some cases download) scripts for the following collections:

Cranfield (a small toy collection)
MS MARCO v1 and v2 (documents and passages)
Wikipedia DPR (Natural Questions, Trivia QA, SQuAD)
Yahoo Answers
Configurable dataset processing of standard datasets provided by ir-datasets.

Acknowledgements

For neural network training FlexNeuART incorporates a substantially re-worked variant of CEDR (MacAvaney et al' 2019).

FlexNeuART
FlexNeuART copied to clipboard

Metadata

FlexNeuART (flex-noo-art)

Objectives

Main features

Documentation

Acknowledgements

← Metadata

Owner

Metadata

FlexNeuART FlexNeuART copied to clipboard

Metadata

FlexNeuART (flex-noo-art)

Objectives

Main features

Documentation

Acknowledgements

← Metadata

Owner

Metadata

FlexNeuART
FlexNeuART copied to clipboard