`keep_in_memory=False` leads to the fact that `dataset.get_vw_document()` is almost unworkable

Open Alvant opened this issue 5 years ago • 2 comments

The method is too slow!

Do we really need dask.dataframe? Maybe better to store documents on disk as single files (and not as one big .csv)?

References:

How one tried to fix the problem locally: TopicBank-Experiment-BankCreation.ipynb, section Lower Time Consumption in Case of Big Datasets

May 12 '20 08:05 Alvant

Please describe the actual scenario (do you get documents 1 by 1 or a whole bunch at the same time?)
It is possible that dask requires some fiddling with options before usage (like running on GPU) but we need to investigate that.

May 25 '20 10:05 Evgeny-Egorov-Projects

Yes, my scenario was just the 1 by 1 case. Intratext coherence score cooperates with Dataset, it retrieves document texts under the hood (many documents, one by one, on each fit iteration)
Maybe this might help... but so far I have doubts. In the referenced notebook it is shown that reading document with dask may lead to nearly 2 sec, whereas reading the same document from disk can be approximately 0.005 sec

May 25 '20 14:05 Alvant