Aliro Load knowledgebase from zipped json file of pd.DataFrame

Instead of using tsv file, we can use pickle file (which was per-generated from pd.DataFrame) to load the results into AI. This way is very fast because no eval process is needed for converting parameters into python dictionary. The dictionary format can be pickled within pd.DataFrame.

But there is one issue about using pickle on regression knowledgebase: the pickle file is over 200Mb due to its large amounts of results while classification's knowledgebase is only 8Mb.

Jan 16 '20 16:01 weixuanfu

I tried to use json file instead of pickle file due to large size. The regression knowledgebase in json format is ~30Mb.

Jan 23 '20 17:01 weixuanfu

Hmm, actually, the gzipped pickle file is less than 20Mb while the gzipped tsv file is more than 30Mb, so I think we can add both options (json/pickle).

Jan 23 '20 17:01 weixuanfu

The screenshot shows that drop_duplicates or DataFrame.apply without hash is much faster even I added one more step to convert frozenset back to dict.

Jan 28 '20 15:01 weixuanfu

Hmm, the new solution above is not working once the classification knowledgebase was merged with the large regression knowledge base.

I tried the use a new Json Encoder to dump dict into a json file but pandas cannot read it. So I kept the current permHash solution.

Jan 28 '20 15:01 weixuanfu

I monitored time usage of deduplicating results is not very slow, which just took ~5 seconds in my PC. updating AI with regression knowledgebase step took ~1 minutes, which need some improvement.

Jan 28 '20 15:01 weixuanfu