TopicNet
TopicNet copied to clipboard
`keep_in_memory=False` leads to the fact that `dataset.get_vw_document()` is almost unworkable
The method is too slow!
Do we really need dask.dataframe? Maybe better to store documents on disk as single files (and not as one big .csv)?
References:
- How one tried to fix the problem locally: TopicBank-Experiment-BankCreation.ipynb, section Lower Time Consumption in Case of Big Datasets
- Please describe the actual scenario (do you get documents 1 by 1 or a whole bunch at the same time?)
- It is possible that dask requires some fiddling with options before usage (like running on GPU) but we need to investigate that.
- Yes, my scenario was just the 1 by 1 case. Intratext coherence score cooperates with Dataset, it retrieves document texts under the hood (many documents, one by one, on each fit iteration)
- Maybe this might help... but so far I have doubts. In the referenced notebook it is shown that reading document with dask may lead to nearly 2 sec, whereas reading the same document from disk can be approximately 0.005 sec