Aliro icon indicating copy to clipboard operation
Aliro copied to clipboard

Load knowledgebase from zipped json file of pd.DataFrame

Open weixuanfu opened this issue 6 years ago • 5 comments

image

Instead of using tsv file, we can use pickle file (which was per-generated from pd.DataFrame) to load the results into AI. This way is very fast because no eval process is needed for converting parameters into python dictionary. The dictionary format can be pickled within pd.DataFrame.

But there is one issue about using pickle on regression knowledgebase: the pickle file is over 200Mb due to its large amounts of results while classification's knowledgebase is only 8Mb.

weixuanfu avatar Jan 16 '20 16:01 weixuanfu

I tried to use json file instead of pickle file due to large size. The regression knowledgebase in json format is ~30Mb.

weixuanfu avatar Jan 23 '20 17:01 weixuanfu

Hmm, actually, the gzipped pickle file is less than 20Mb while the gzipped tsv file is more than 30Mb, so I think we can add both options (json/pickle).

weixuanfu avatar Jan 23 '20 17:01 weixuanfu

image

The screenshot shows that drop_duplicates or DataFrame.apply without hash is much faster even I added one more step to convert frozenset back to dict.

weixuanfu avatar Jan 28 '20 15:01 weixuanfu

Hmm, the new solution above is not working once the classification knowledgebase was merged with the large regression knowledge base.

I tried the use a new Json Encoder to dump dict into a json file but pandas cannot read it. So I kept the current permHash solution.

weixuanfu avatar Jan 28 '20 15:01 weixuanfu

I monitored time usage of deduplicating results is not very slow, which just took ~5 seconds in my PC. updating AI with regression knowledgebase step took ~1 minutes, which need some improvement.

weixuanfu avatar Jan 28 '20 15:01 weixuanfu