ctrl-sum icon indicating copy to clipboard operation
ctrl-sum copied to clipboard

Preprocess stuck

Open nikhilrayaprolu opened this issue 4 years ago • 5 comments

🐛 Bug

On executing python scripts/preprocess.py cnndm --mode pipeline Preprocessing stuck at this point:

image

some of the oraclewords are not generated too.

image

Environment

  • fairseq Version (e.g., 1.0 or master): recommended commit
  • PyTorch Version (e.g., 1.0) : 1.8
  • OS (e.g., Linux): Linux
  • How you installed fairseq (pip, source): source
  • Build command you used (if compiling from source): source
  • Python version: 3.6.8
  • CUDA/cuDNN version: 10.2

nikhilrayaprolu avatar May 14 '21 08:05 nikhilrayaprolu

@jxhe @muggin

nikhilrayaprolu avatar May 14 '21 08:05 nikhilrayaprolu

Hi @nikhilrayaprolu,

I faced the same problem with you, it was because the preprocessing script on whole cnndm training dataset took more than 32GB RAM. I would suggest you to split the train set into several parts, then merge them later after preprocess on those parts finished.

geeraay avatar May 23 '21 14:05 geeraay

thanks for the reply @geeraay

nikhilrayaprolu avatar May 24 '21 06:05 nikhilrayaprolu

@geeraay can you provide some more explanation on how the splitting and merging is done. Any accompanying code would really be helpful.

nikhilrayaprolu avatar May 25 '21 12:05 nikhilrayaprolu

I don't remember the exact step I've done back then, but the idea is this.

I did something like split -n l/${nsplit} /path-to-file/train.source /path-to-file/train.source. it will create train.source.00, train.source.01, ... , train.source.${nsplit}

Then I rename the generated files to train_1.source, train_2.source, ..., train_${nsplit}.source.

After that you could run python scripts/preprocess.py cnndm --mode pipeline --split train_1,train_2,...,train_${nsplit}

wait until the preprocessing step is done, then I manually copy and paste the generated files into one big train.source file.

Or you can simply use bigger RAM machine to preprocess without splitting the file.

geeraay avatar May 26 '21 06:05 geeraay