sling icon indicating copy to clipboard operation
sling copied to clipboard

How to run the silver annotation pipeline

Open foolfun opened this issue 5 years ago • 40 comments

I ran the DrKIT code which includes 'sling/local/data/distant/facts-0000%d-of-00010.json', I have no idea how to get it?

foolfun avatar Nov 26 '20 02:11 foolfun

+1

CSQianDong avatar Nov 26 '20 09:11 CSQianDong

The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:

sling build_idf

Then you can run silver annotation on all the Wikipedia articles:

sling silver_annotation

It takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.

ringgaard avatar Nov 26 '20 13:11 ringgaard

Hi, thanks for your reply. When I run sling silver_annotation, I got the error massage: [2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0 do you have any idea about this error?

CSQianDong avatar Nov 27 '20 10:11 CSQianDong

I remember having seen this error before. Let me check if there are some changes from the dev branch that I haven't submitted to the master branch.

ringgaard avatar Nov 27 '20 11:11 ringgaard

The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:

sling build_idf

Then you can run silver annotation on all the Wikipedia articles:

sling silver_annotation

It takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.

Thanks for your reply! When I ran ‘ sling fuse_items’ met https://github.com/ringgaard/sling/issues/4. I have no idea why it happened, can you help me?

foolfun avatar Nov 27 '20 14:11 foolfun

It seems like I will have to do a complete test run of the wiki and silver annotation pipelines. I run these in a slightly different mode using wiki snapshots to get a wikidata dump and the reconciler for fusing items. It seems like there is some bug in the old pipeline.

You should check if you have enough disk space. You will need something like 500 GB free space on you hard drive including your temp directory (usually /tmp). There has been reports about out-of-disk-space is not always reported correctly. You should also check that you don't have a bunch of temp files from runs that crashed. You can remove old temp files using this command:

rm -r /tmp/local.*

It is going to take a while to rerunning the pipelines, so please be patient. I will try to do this over the weekend. I have a server upgrade Sunday which will also delay this.

ringgaard avatar Nov 27 '20 14:11 ringgaard

It seems like I will have to do a complete test run of the wiki and silver annotation pipelines. I run these in a slightly different mode using wiki snapshots to get a wikidata dump and the reconciler for fusing items. It seems like there is some bug in the old pipeline.

You should check if you have enough disk space. You will need something like 500 GB free space on you hard drive including your temp directory (usually /tmp). There has been reports about out-of-disk-space is not always reported correctly. You should also check that you don't have a bunch of temp files from runs that crashed. You can remove old temp files using this command:

rm -r /tmp/local.*

It is going to take a while to rerunning the pipelines, so please be patient. I will try to do this over the weekend. I have a server upgrade Sunday which will also delay this.

Thank you so much. I will take your advice to try it again.

foolfun avatar Nov 28 '20 07:11 foolfun

I have enough disk space, remove the old temp files and run the command follow. However, it seems that I met the same problem again. Looking forward to your reply

export TMPDIR=/mnt/hdd1/tmp

sling build_wiki --lbzip2 --languages en

image

foolfun avatar Nov 28 '20 12:11 foolfun

@foolfun try sling build_wiki. Withou lbzip2 and languages. It woks for me.

CSQianDong avatar Nov 28 '20 13:11 CSQianDong

I think I managed to fix the error that caused fuse_items to crash, so if you sync to HEAD you should be able to run the wiki pipeline. See this commit.

You can just resume from the fuse_items stage, so you don't need to re-run the whole wiki pipeline again:

sling fuse_items build_kb extract_names build_nametab build_phrasetab

Next, I will try to see if I can reproduce the CHECK fault in the silver annotation pipeline: [2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0

ringgaard avatar Nov 28 '20 21:11 ringgaard

I think I managed to fix the error that caused fuse_items to crash, so if you sync to HEAD you should be able to run the wiki pipeline. See this commit.

You can just resume from the fuse_items stage, so you don't need to re-run the whole wiki pipeline again:

sling fuse_items build_kb extract_names build_nametab build_phrasetab

Next, I will try to see if I can reproduce the CHECK fault in the silver annotation pipeline: [2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0

it works! I have been troubled by the issue for nearly two weeks, thank you very much!

foolfun avatar Nov 29 '20 07:11 foolfun

The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:

sling build_idf

Then you can run silver annotation on all the Wikipedia articles:

sling silver_annotation

It takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.

I have run the silver annotation pipeline and the result shows in the following picture. However, I still can not find 'local/data/e/silver/en/silver-00000-of-00010.rec'. I don`t know whether I miss some important steps. Can you help me? image

By the way, the files I can find are: image

foolfun avatar Dec 01 '20 02:12 foolfun

The output looks correct. The silver-annotated Wikipedia documents are in train-*.rec and eval.rec. Together these contain all the Wikipedia articles. They are split into train and eval because I use this data as noisy training data for the semantic parser.

NB: I did a complete run of the silver annotation pipeline and it did not get the Check failed: num >= 0 error. This error could be due to Wikidata errors in the date items. My version of Wikidata is from Nov 25.

ringgaard avatar Dec 01 '20 09:12 ringgaard

The output looks correct. The silver-annotated Wikipedia documents are in train-*.rec and eval.rec. Together these contain all the Wikipedia articles. They are split into train and eval because I use this data as noisy training data for the semantic parser.

NB: I did a complete run of the silver annotation pipeline and it did not get the Check failed: num >= 0 error. This error could be due to Wikidata errors in the date items. My version of Wikidata is from Nov 25.

I can not find 'e/silver/en/silver-0000%d-of-00010.rec' . How can I get the file?

foolfun avatar Dec 01 '20 13:12 foolfun

From where did you get the impression that the silver annotations should be in e/silver/en/silver-0000%d-of-00010.rec?

The silver annotations are in local/data/e/silver/en/train-?????-of-00010.rec and local/data/e/silver/en/eval.rec. You can take a look at the data with the codex tool:

bin/codex data/e/silver/en/train-00000-of-00010.rec | less

Each record is a Wikipedia article and contains the title, the raw text, the tokens, and the mentions with evoked frames.

ringgaard avatar Dec 01 '20 14:12 ringgaard

Hi, Ringgaard, I'm using google/sling to get sliver annotation. And I get the problem "Check failed: num >= 0"``. Cause I don't have a SUDO right to build sling in your repository. Do you have any idea about how to deal with this?

CSQianDong avatar Dec 02 '20 02:12 CSQianDong

From where did you get the impression that the silver annotations should be in e/silver/en/silver-0000%d-of-00010.rec?

The silver annotations are in local/data/e/silver/en/train-?????-of-00010.rec and local/data/e/silver/en/eval.rec. You can take a look at the data with the codex tool:

bin/codex data/e/silver/en/train-00000-of-00010.rec | less

Each record is a Wikipedia article and contains the title, the raw text, the tokens, and the mentions with evoked frames.

Sorry,I try to run distantly_supervise.py which needs silver-0000%d-of-00010.rec in line 543. It is why I want to consult you about the way to get this file.

I tried to replace silver-0000%d-of-00010.rec with train-0000%d-of-00010.rec, but it showed line 348 kb_item gets None. Then, I guess this way may not work. Do you have any idea about how to deal with this?

foolfun avatar Dec 02 '20 07:12 foolfun

image when using the google/sling, I got the sliver-* files. But it's not correct because it's not processed completely.

CSQianDong avatar Dec 02 '20 07:12 CSQianDong

The problem seems to be that the distantly_supervise.py script expects the silver data to be indexed by QIDs but the silver pipeline assigns random keys in order to shuffle the data set for training. How many documents do you need to extract? Is it all of them or just a small subset?

ringgaard avatar Dec 02 '20 11:12 ringgaard

all of them, I think

CSQianDong avatar Dec 02 '20 12:12 CSQianDong

There are basically two solutions: either take the train and eval files and reindex them, or make a new silver workflow that is compatible with the old mode.

Let me first check out how difficult it would be to make a custom silver workflow that produces the output that distantly_supervise.py expects.

ringgaard avatar Dec 02 '20 13:12 ringgaard

Pretty thanks a lot!

CSQianDong avatar Dec 02 '20 13:12 CSQianDong

With the Python script below you should be able to produce the silver-*.rec output that should be compatible with distantly_supervise.py:

import sling
import sling.flags as flags
import sling.log as log
import sling.task.workflow as workflow
import sling.task.wiki as wiki
import sling.task.corpora as corpora

flags.parse()
workflow.startup()

language = flags.arg.language
workdir = flags.arg.workdir + "/silver/" + language

wf = workflow.Workflow("silver")
wikiwf = wiki.WikiWorkflow(wf=wf)

indocs = wikiwf.wikipedia_documents(language)
outdocs = wf.resource("[email protected]", dir=workdir, format="records/document")
idf = wf.resource("idf.repo", dir=workdir, format="repository")

config = corpora.repository("data/wiki/" + language + "/silver.sling")
phrases = corpora.repository("data/wiki/" + language) + "/phrases.txt"

mapper = wf.task("document-processor", "labeler")
mapper.add_annotator("mentions")
mapper.add_annotator("anaphora")
mapper.add_annotator("phrase-structure")
mapper.add_annotator("relations")

mapper.add_param("resolve", True)
mapper.add_param("language", language)

mapper.attach_input("commons", wikiwf.knowledge_base())
mapper.attach_input("commons", wf.resource(config, format="store/frame"))

mapper.attach_input("aliases", wikiwf.phrase_table(language))
mapper.attach_input("dictionary", idf)
mapper.attach_input("phrases", wf.resource(phrases, format="lex"))

wf.connect(wf.read(indocs), mapper)
output = wf.channel(mapper, format="message/document")
wf.write(output, outdocs)

workflow.run(wf)
workflow.shutdown()

You can check the output with this command:

bin/codex --lex local/data/e/silver/en/silver* 

ringgaard avatar Dec 02 '20 13:12 ringgaard

Hmm... My test run seems to indicate that the script above does not read the stopword and blacklists correctly, resulting in many spammy annotations. Let me try to fix this.

ringgaard avatar Dec 02 '20 14:12 ringgaard

image

Emm...When I run this script, I got the same error.

CSQianDong avatar Dec 02 '20 14:12 CSQianDong

Is there a stack trace below the "Check failed:" line?

ringgaard avatar Dec 02 '20 14:12 ringgaard

(core dumped)

CSQianDong avatar Dec 02 '20 14:12 CSQianDong

The CHECK fault indicates that some invalid date is being processed. You could just comment out the CHECK in line 41 of calendar.cc. It would cause some invalid dates in the output annotations, but without further information, I don't know how to fix this.

ringgaard avatar Dec 02 '20 14:12 ringgaard

I have updated the Python script above to include the configuration of stopwords and blacklists. The following lines were missing:

config = corpora.repository("data/wiki/" + language + "/silver.sling")
mapper.attach_input("commons", wf.resource(config, format="store/frame"))

This should remove a lot of spammy annotations for common words and phrases.

ringgaard avatar Dec 02 '20 14:12 ringgaard

Hi, ringgaard! When I ran the script, I met this problem: image

foolfun avatar Dec 02 '20 14:12 foolfun