TopicNet icon indicating copy to clipboard operation
TopicNet copied to clipboard

`keep_in_memory=False` leads to the fact that `dataset.get_vw_document()` is almost unworkable

Open Alvant opened this issue 5 years ago • 2 comments

The method is too slow!

Do we really need dask.dataframe? Maybe better to store documents on disk as single files (and not as one big .csv)?

References:

Alvant avatar May 12 '20 08:05 Alvant

  1. Please describe the actual scenario (do you get documents 1 by 1 or a whole bunch at the same time?)
  2. It is possible that dask requires some fiddling with options before usage (like running on GPU) but we need to investigate that.

Evgeny-Egorov-Projects avatar May 25 '20 10:05 Evgeny-Egorov-Projects

  1. Yes, my scenario was just the 1 by 1 case. Intratext coherence score cooperates with Dataset, it retrieves document texts under the hood (many documents, one by one, on each fit iteration)
  2. Maybe this might help... but so far I have doubts. In the referenced notebook it is shown that reading document with dask may lead to nearly 2 sec, whereas reading the same document from disk can be approximately 0.005 sec

Alvant avatar May 25 '20 14:05 Alvant