ctrl-sum Preprocess stuck

🐛 Bug

On executing python scripts/preprocess.py cnndm --mode pipeline Preprocessing stuck at this point:

some of the oraclewords are not generated too.

Environment

fairseq Version (e.g., 1.0 or master): recommended commit
PyTorch Version (e.g., 1.0) : 1.8
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): source
Python version: 3.6.8
CUDA/cuDNN version: 10.2

May 14 '21 08:05 nikhilrayaprolu

@jxhe @muggin

May 14 '21 08:05 nikhilrayaprolu

Hi @nikhilrayaprolu,

I faced the same problem with you, it was because the preprocessing script on whole cnndm training dataset took more than 32GB RAM. I would suggest you to split the train set into several parts, then merge them later after preprocess on those parts finished.

May 23 '21 14:05 geeraay

thanks for the reply @geeraay

May 24 '21 06:05 nikhilrayaprolu

@geeraay can you provide some more explanation on how the splitting and merging is done. Any accompanying code would really be helpful.

May 25 '21 12:05 nikhilrayaprolu

I don't remember the exact step I've done back then, but the idea is this.

I did something like split -n l/${nsplit} /path-to-file/train.source /path-to-file/train.source. it will create train.source.00, train.source.01, ... , train.source.${nsplit}

Then I rename the generated files to train_1.source, train_2.source, ..., train_${nsplit}.source.

After that you could run python scripts/preprocess.py cnndm --mode pipeline --split train_1,train_2,...,train_${nsplit}

wait until the preprocessing step is done, then I manually copy and paste the generated files into one big train.source file.

Or you can simply use bigger RAM machine to preprocess without splitting the file.

May 26 '21 06:05 geeraay