Splitting class not working properly

Open darinkishore opened this issue 2 years ago • 1 comments

Describe the bug When splitting a document, regardless of parameters put into preprocessor, the document content is not split at all.

Expected behavior Calling preprocessor.split() with split_by='word' will return a list of document objects corresponding to the individual words in document.content.

To Reproduce

preprocessor = PreProcessor(
        clean_whitespace=True,
        clean_empty_lines=True,
        split_by='word',
        split_respect_sentence_boundary=False
    )
test_document = document(content="This is a test document. I like MARHSHMALOWS!", content_type="text")
documents = preprocessor.process(documents=test_document)
[<Document: {'content': 'This is a test document. I like MARHSHMALOWS!', 'content_type': 'text', 'score': None, 'meta': {'_split_id': 0}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '1f511088079fecb84ba3f84752985b51'}>]

FAQ Check

[x] Have you had a look at our new FAQ page?

System:

OS: Ubuntu
GPU/CPU: 8x A100
Haystack version (commit or version number): Latest
DocumentStore: FAISSDocumentStore
Reader: n/a
Retriever: EmbeddingRetriever( embedding_model="sentence-transformers/all-distilroberta-v1", model_format="sentence_transformers" )

Oct 05 '23 23:10 darinkishore

Hello, @darinkishore!

Please check the API Reference.

If not specified, split_length is 200, so you are currently dividing your Document into chunks of 200 words...

Try:

preprocessor = PreProcessor(
        split_length=2,
        clean_whitespace=True,
        clean_empty_lines=True,
        split_by='word',
        split_respect_sentence_boundary=False
    )

Oct 05 '23 23:10 anakin87