How to run the silver annotation pipeline
I ran the DrKIT code which includes 'sling/local/data/distant/facts-0000%d-of-00010.json', I have no idea how to get it?
+1
The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:
sling build_idf
Then you can run silver annotation on all the Wikipedia articles:
sling silver_annotation
It takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.
Hi, thanks for your reply. When I run sling silver_annotation, I got the error massage: [2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0 do you have any idea about this error?
I remember having seen this error before. Let me check if there are some changes from the dev branch that I haven't submitted to the master branch.
The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:
sling build_idfThen you can run silver annotation on all the Wikipedia articles:
sling silver_annotationIt takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.
Thanks for your reply! When I ran ‘ sling fuse_items’ met https://github.com/ringgaard/sling/issues/4. I have no idea why it happened, can you help me?
It seems like I will have to do a complete test run of the wiki and silver annotation pipelines. I run these in a slightly different mode using wiki snapshots to get a wikidata dump and the reconciler for fusing items. It seems like there is some bug in the old pipeline.
You should check if you have enough disk space. You will need something like 500 GB free space on you hard drive including your temp directory (usually /tmp). There has been reports about out-of-disk-space is not always reported correctly. You should also check that you don't have a bunch of temp files from runs that crashed. You can remove old temp files using this command:
rm -r /tmp/local.*
It is going to take a while to rerunning the pipelines, so please be patient. I will try to do this over the weekend. I have a server upgrade Sunday which will also delay this.
It seems like I will have to do a complete test run of the wiki and silver annotation pipelines. I run these in a slightly different mode using wiki snapshots to get a wikidata dump and the reconciler for fusing items. It seems like there is some bug in the old pipeline.
You should check if you have enough disk space. You will need something like 500 GB free space on you hard drive including your temp directory (usually /tmp). There has been reports about out-of-disk-space is not always reported correctly. You should also check that you don't have a bunch of temp files from runs that crashed. You can remove old temp files using this command:
rm -r /tmp/local.*It is going to take a while to rerunning the pipelines, so please be patient. I will try to do this over the weekend. I have a server upgrade Sunday which will also delay this.
Thank you so much. I will take your advice to try it again.
I have enough disk space, remove the old temp files and run the command follow. However, it seems that I met the same problem again. Looking forward to your reply
export TMPDIR=/mnt/hdd1/tmp
sling build_wiki --lbzip2 --languages en

@foolfun try sling build_wiki. Withou lbzip2 and languages. It woks for me.
I think I managed to fix the error that caused fuse_items to crash, so if you sync to HEAD you should be able to run the wiki pipeline. See this commit.
You can just resume from the fuse_items stage, so you don't need to re-run the whole wiki pipeline again:
sling fuse_items build_kb extract_names build_nametab build_phrasetab
Next, I will try to see if I can reproduce the CHECK fault in the silver annotation pipeline: [2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0
I think I managed to fix the error that caused fuse_items to crash, so if you sync to HEAD you should be able to run the wiki pipeline. See this commit.
You can just resume from the fuse_items stage, so you don't need to re-run the whole wiki pipeline again:
sling fuse_items build_kb extract_names build_nametab build_phrasetabNext, I will try to see if I can reproduce the CHECK fault in the silver annotation pipeline: [2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0
it works! I have been troubled by the issue for nearly two weeks, thank you very much!
The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:
sling build_idfThen you can run silver annotation on all the Wikipedia articles:
sling silver_annotationIt takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.
I have run the silver annotation pipeline and the result shows in the following picture. However, I still can not find 'local/data/e/silver/en/silver-00000-of-00010.rec'. I don`t know whether I miss some important steps. Can you help me?

By the way, the files I can find are:

The output looks correct. The silver-annotated Wikipedia documents are in train-*.rec and eval.rec. Together these contain all the Wikipedia articles. They are split into train and eval because I use this data as noisy training data for the semantic parser.
NB: I did a complete run of the silver annotation pipeline and it did not get the Check failed: num >= 0 error. This error could be due to Wikidata errors in the date items. My version of Wikidata is from Nov 25.
The output looks correct. The silver-annotated Wikipedia documents are in train-*.rec and eval.rec. Together these contain all the Wikipedia articles. They are split into train and eval because I use this data as noisy training data for the semantic parser.
NB: I did a complete run of the silver annotation pipeline and it did not get the
Check failed: num >= 0error. This error could be due to Wikidata errors in the date items. My version of Wikidata is from Nov 25.
I can not find 'e/silver/en/silver-0000%d-of-00010.rec' . How can I get the file?
From where did you get the impression that the silver annotations should be in e/silver/en/silver-0000%d-of-00010.rec?
The silver annotations are in local/data/e/silver/en/train-?????-of-00010.rec and local/data/e/silver/en/eval.rec. You can take a look at the data with the codex tool:
bin/codex data/e/silver/en/train-00000-of-00010.rec | less
Each record is a Wikipedia article and contains the title, the raw text, the tokens, and the mentions with evoked frames.
Hi, Ringgaard, I'm using google/sling to get sliver annotation. And I get the problem "Check failed: num >= 0"``. Cause I don't have a SUDO right to build sling in your repository. Do you have any idea about how to deal with this?
From where did you get the impression that the silver annotations should be in
e/silver/en/silver-0000%d-of-00010.rec?The silver annotations are in
local/data/e/silver/en/train-?????-of-00010.recandlocal/data/e/silver/en/eval.rec. You can take a look at the data with the codex tool:bin/codex data/e/silver/en/train-00000-of-00010.rec | lessEach record is a Wikipedia article and contains the title, the raw text, the tokens, and the mentions with evoked frames.
Sorry,I try to run distantly_supervise.py which needs silver-0000%d-of-00010.rec in line 543. It is why I want to consult you about the way to get this file.
I tried to replace silver-0000%d-of-00010.rec with train-0000%d-of-00010.rec, but it showed line 348 kb_item gets None. Then, I guess this way may not work. Do you have any idea about how to deal with this?
when using the google/sling, I got the sliver-* files. But it's not correct because it's not processed completely.
The problem seems to be that the distantly_supervise.py script expects the silver data to be indexed by QIDs but the silver pipeline assigns random keys in order to shuffle the data set for training. How many documents do you need to extract? Is it all of them or just a small subset?
all of them, I think
There are basically two solutions: either take the train and eval files and reindex them, or make a new silver workflow that is compatible with the old mode.
Let me first check out how difficult it would be to make a custom silver workflow that produces the output that distantly_supervise.py expects.
Pretty thanks a lot!
With the Python script below you should be able to produce the silver-*.rec output that should be compatible with distantly_supervise.py:
import sling
import sling.flags as flags
import sling.log as log
import sling.task.workflow as workflow
import sling.task.wiki as wiki
import sling.task.corpora as corpora
flags.parse()
workflow.startup()
language = flags.arg.language
workdir = flags.arg.workdir + "/silver/" + language
wf = workflow.Workflow("silver")
wikiwf = wiki.WikiWorkflow(wf=wf)
indocs = wikiwf.wikipedia_documents(language)
outdocs = wf.resource("[email protected]", dir=workdir, format="records/document")
idf = wf.resource("idf.repo", dir=workdir, format="repository")
config = corpora.repository("data/wiki/" + language + "/silver.sling")
phrases = corpora.repository("data/wiki/" + language) + "/phrases.txt"
mapper = wf.task("document-processor", "labeler")
mapper.add_annotator("mentions")
mapper.add_annotator("anaphora")
mapper.add_annotator("phrase-structure")
mapper.add_annotator("relations")
mapper.add_param("resolve", True)
mapper.add_param("language", language)
mapper.attach_input("commons", wikiwf.knowledge_base())
mapper.attach_input("commons", wf.resource(config, format="store/frame"))
mapper.attach_input("aliases", wikiwf.phrase_table(language))
mapper.attach_input("dictionary", idf)
mapper.attach_input("phrases", wf.resource(phrases, format="lex"))
wf.connect(wf.read(indocs), mapper)
output = wf.channel(mapper, format="message/document")
wf.write(output, outdocs)
workflow.run(wf)
workflow.shutdown()
You can check the output with this command:
bin/codex --lex local/data/e/silver/en/silver*
Hmm... My test run seems to indicate that the script above does not read the stopword and blacklists correctly, resulting in many spammy annotations. Let me try to fix this.

Emm...When I run this script, I got the same error.
Is there a stack trace below the "Check failed:" line?
(core dumped)
The CHECK fault indicates that some invalid date is being processed. You could just comment out the CHECK in line 41 of calendar.cc. It would cause some invalid dates in the output annotations, but without further information, I don't know how to fix this.
I have updated the Python script above to include the configuration of stopwords and blacklists. The following lines were missing:
config = corpora.repository("data/wiki/" + language + "/silver.sling")
mapper.attach_input("commons", wf.resource(config, format="store/frame"))
This should remove a lot of spammy annotations for common words and phrases.
Hi, ringgaard! When I ran the script, I met this problem:
