haystack
haystack copied to clipboard
Splitting class not working properly
Describe the bug When splitting a document, regardless of parameters put into preprocessor, the document content is not split at all.
Expected behavior Calling preprocessor.split() with split_by='word' will return a list of document objects corresponding to the individual words in document.content.
To Reproduce
preprocessor = PreProcessor(
clean_whitespace=True,
clean_empty_lines=True,
split_by='word',
split_respect_sentence_boundary=False
)
test_document = document(content="This is a test document. I like MARHSHMALOWS!", content_type="text")
documents = preprocessor.process(documents=test_document)
[<Document: {'content': 'This is a test document. I like MARHSHMALOWS!', 'content_type': 'text', 'score': None, 'meta': {'_split_id': 0}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '1f511088079fecb84ba3f84752985b51'}>]
FAQ Check
- [x] Have you had a look at our new FAQ page?
System:
- OS: Ubuntu
- GPU/CPU: 8x A100
- Haystack version (commit or version number): Latest
- DocumentStore: FAISSDocumentStore
- Reader: n/a
- Retriever: EmbeddingRetriever( embedding_model="sentence-transformers/all-distilroberta-v1", model_format="sentence_transformers" )
Hello, @darinkishore!
Please check the API Reference.
If not specified, split_length is 200, so you are currently dividing your Document into chunks of 200 words...
Try:
preprocessor = PreProcessor(
split_length=2,
clean_whitespace=True,
clean_empty_lines=True,
split_by='word',
split_respect_sentence_boundary=False
)