Preprocess stuck
🐛 Bug
On executing python scripts/preprocess.py cnndm --mode pipeline
Preprocessing stuck at this point:

some of the oraclewords are not generated too.

Environment
- fairseq Version (e.g., 1.0 or master): recommended commit
- PyTorch Version (e.g., 1.0) : 1.8
- OS (e.g., Linux): Linux
- How you installed fairseq (
pip, source): source - Build command you used (if compiling from source): source
- Python version: 3.6.8
- CUDA/cuDNN version: 10.2
@jxhe @muggin
Hi @nikhilrayaprolu,
I faced the same problem with you, it was because the preprocessing script on whole cnndm training dataset took more than 32GB RAM. I would suggest you to split the train set into several parts, then merge them later after preprocess on those parts finished.
thanks for the reply @geeraay
@geeraay can you provide some more explanation on how the splitting and merging is done. Any accompanying code would really be helpful.
I don't remember the exact step I've done back then, but the idea is this.
I did something like
split -n l/${nsplit} /path-to-file/train.source /path-to-file/train.source.
it will create train.source.00, train.source.01, ... , train.source.${nsplit}
Then I rename the generated files to
train_1.source, train_2.source, ..., train_${nsplit}.source.
After that you could run
python scripts/preprocess.py cnndm --mode pipeline --split train_1,train_2,...,train_${nsplit}
wait until the preprocessing step is done, then I manually copy and paste the generated files into one big train.source file.
Or you can simply use bigger RAM machine to preprocess without splitting the file.