cnndm_acl18
cnndm_acl18 copied to clipboard
Code to obtain the training data for the ACL 2018 paper "Neural Document Summarization by Jointly Learning to Score and Select Sentences"
Data processing for NeuSum
This repo contains the code which can generate the training data (CNN / Daily Mail) needed by NeuSum.
-
Preprocess CNN/DM dataset using abisee's scripts: https://github.com/abisee/cnn-dailymail
-
Convert its output to the format shown in the
sample_datafolder. The format of files:- File train.txt.src is the input document. Each line contains several tokenized sentences delimited by ##SENT## of a document.
- File train.txt.tgt is the summary of document. Each line contains several tokenized summaries delimited by ##SENT## of the corresponding document.
-
Use
find_oracle.pyto search the best sentences to be extracted. The arguments of themainfunctions are:document_file,summary_fileandoutput_path. -
Next, build the ROUGE score gain file using
get_mmr_regression_gain.py. The usage is shown in the code entry.
Note
The algorithm is a brute-force search, which can be slow in some cases. Therefore, running it in parallel is recommended (and it is what I did in my experiments).
Recently, I modify the find_oracle.py a little using multiprocessing so that it can be easier to run it in parallel. Please check out find_oracle_para.py.