zoe icon indicating copy to clipboard operation
zoe copied to clipboard

Question about testing on new data

Open zhaoxy92 opened this issue 6 years ago • 4 comments

Hi, I'm trying to run ZOE on a new dataset and the following questions were raised:

  1. In the main.py, should I comment out runner.elmo_processor.load_cached_embeddings("target.min.embedding.pickle", "wikilinks.min.embedding.pickle")? If yes, could you show me how these two files are generated and what are the format for the raw version of these two files? Currently I found running new data is extremely slow (processed 30 sentences after one night). Anything idea how I can speed up things?

  2. Are there any other files/data I need to generate for testing on new dataset? (maybe vocab_test.txt?)

Thank you!

zhaoxy92 avatar Jul 31 '19 15:07 zhaoxy92

  1. The speed is slow on non-cached Wikipedia titles, especially on CPUs, because it runs multiple ELMo inferences to generate a title's representation. I could provide a huge SQLite file (~72GB) that contains all the Wikipedia titles, do you want me to share it? By having that file, you could use this function instead of load_cached_embeddings. Furthermore, it is recommended to cache your test set as well, i.e. store what candidates are found at each instance so that you can tune your type inference at a low cost. To do this, I would suggest storing results into a map and pickle that map.

  2. Everything should work fine if you have your type mapping (inference) part working. The previous point only speeds things up, without any impact on the results.

Slash0BZ avatar Jul 31 '19 17:07 Slash0BZ

Thank you. Please share it with me! Really appreciate it!

On Wed, Jul 31, 2019 at 10:18 AM Xuanyu Zhou [email protected] wrote:

The speed is slow on non-cached Wikipedia titles, especially on CPUs, because it runs multiple ELMo inferences to generate a title's representation. I could provide a huge SQLite file (~72GB) that contains all the Wikipedia titles, do you want me to share it? By having that file, you could use this function https://github.com/CogComp/zoe/blob/master/zoe_utils.py#L39 instead of load_cached_embeddings. Furthermore, it is recommended to cache your test set as well, i.e. store what candidates are found at each instance so that you can tune your type inference at a low cost. To do this, I would suggest storing results into a map and pickle that map. 2.

Everything should work fine if you have your type mapping (inference) part working. The previous point only speeds things up, without any impact on the results.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CogComp/zoe/issues/30?email_source=notifications&email_token=AFB56KISOX5OALWGS5P5TT3QCHCMZA5CNFSM4IIH7RI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3H6I7A#issuecomment-516940924, or mute the thread https://github.com/notifications/unsubscribe-auth/AFB56KKCVRXOCBC4ZLGAZC3QCHCMZANCNFSM4IIH7RIQ .

zhaoxy92 avatar Jul 31 '19 18:07 zhaoxy92

Updated the file "elmo_cache_correct.db" in the Google Drive https://drive.google.com/drive/u/1/folders/1fD6WfCEPQICGPhxqlwuVmf8uOot-jQq8?ths=true. Sorry for the delay, it's a huge file to upload.

To use it, please refer to the function pointer above, and set server_mode=False.

Slash0BZ avatar Aug 01 '19 13:08 Slash0BZ

Thank you. Downloading it now, will bother you more if there is any further problems!

zhaoxy92 avatar Aug 09 '19 22:08 zhaoxy92