Revisit tests
Problem
At the moment there is some confusion about how test functions are categorized and it's not easy to distinguish between unit and integration tests. Fixing this problem would improve the development experience by making the tests fail fast if there is an issue and by making it more obvious what's the issue.
Proposal
We formally define three scopes for tests in Haystack with different requirements and purposes; newly added tests will follow those requirements while we progressively adapt the existing tests to the new model.
The three scopes are defined as follows:
-
Unit test
- Tests a single logical concept
- Execution time is a few milliseconds
- Any external resource is mocked
- Always returns the same result
- Can run in any order
- Runs at every commit in
draftandreadyPRs, automated through pytest - Can run locally with no additional setup
- Goal: being confident in merging code
-
Integration test
- Tests a single logical concept
- Execution time is a few seconds
- It uses external resources that must be available before execution
- When using models, cannot use inference
- Always returns the same result or an error
- Can run in any order
- Runs at every commit in
readyPRs, automated through pytest - Can run locally with some additional setup (e.g. Docker)
- Goal: being confident in merging code
-
End to End (e2e) test
- Tests a sequence of multiple logical concepts
- Execution time has no limits (can be always on)
- Can use inference
- Evaluates the results of the execution or the status of the system
- It uses external resources that must be available before execution
- Can return different results
- Can be dependent on the order
- Can be wrapped into any process execution
- Runs outside the development cycle (nightly or on demand)
- Might not be possible to run locally due to system and hardware requirements
- Goal: being confident in releasing Haystack
Action plan
Note: this planning will be subject to heavy changes as we progress understanding how the tests could be improved.
Step 1: Identify and isolate e2e tests
Test code will be moved in a separate folder and the directory tree will reflect the code layout. New workflows will be created as needed, running on a scheduler.
- [ ] Identify and isolate e2e tests on document_stores package
- [ ] Identify and isolate e2e tests on modeling package
- [ ] Identify and isolate e2e tests on nodes package
- [ ] Identify and isolate e2e tests on pipelines package
- [ ] #2893
Step 2: Identify and separate unit and integration tests in document stores
- [x] #2943
- [ ] #2892
- [ ] Identify and separate unit and integration tests in the base suite (#2906)
- [ ] Identify and separate unit and integration tests in Elasticsearch
- [ ] Identify and separate unit and integration tests in Pinecone
- [ ] Identify and separate unit and integration tests in DeepsetCloudDocumentStore
- [ ] Identify and separate unit and integration tests in Weaviate
- [ ] Identify and separate unit and integration tests in Faiss
- [ ] Identify and separate unit and integration tests in Milvus
- [ ] Identify and separate unit and integration tests in Knowledge graph
Step 3: Identify and separate unit and integration tests in other packages
These tasks might include partial refactoring of the existing testing workflow.
- [ ] Modeling -> most of them are end-to-end
- [ ] Pipeline -> many unit tests, a few end-to-end, no integration
- [ ] Nodes -> Very varied, probably best to tackle last
Interesting case of tests on the line between integration and end-to-end: https://github.com/deepset-ai/haystack/pull/2903
These tests are e2e according to the definition above, because they initialize a model and do inference on it. However, the nodes are classifiers, so for the purpose of testing, mocking the models is trivial and matter of a simple if-else. So, are these tests better seen as end-to-end due to the inference step, or they should rather be mocked and used as integration tests?
Note that these, in my opinion, are too high level to be considered unit tests, regardless of the presence of the mock or their speed. They are testing most of the node at once and imho this is too wide of a scope.
We decided to split this epic and continue the work in chunks, see the updated issue description. Further conversations will happen in their respective epic.
With the newly reduced scope, I'm going to call this one done and close.