script_buddy_v2 Trying to get .ipynb to run

Hello! Thank you for posting. As per your blog, I opened up a Paperspace instance and cloned your repo, then ran:

git clone https://github.com/cdpierse/script_buddy_v2.git pip install -r requirements.txt

It all worked perfectly.

Then I opened script_generation.ipynb. I tried to run it, and I got this error:

OSError: Model name './storage/models/' was not found in tokenizers model name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). We assumed './storage/models/' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

Dec 23 '20 02:12 Tylersuard

Hey @Tylersuard I know what the issue is here, the model and tokenizer were stored locally at this stage of the process. Thankfully it's all up on the Transformers model hub now. If you make the change below It should all work.

Change

tokenizer = GPT2Tokenizer.from_pretrained(output_dir)
model = GPT2LMHeadModel.from_pretrained(output_dir)

To:

tokenizer = GPT2Tokenizer.from_pretrained("cpierse/gpt2_film_scripts")
model = GPT2LMHeadModel.from_pretrained("cpierse/gpt2_film_scripts")

Dec 28 '20 20:12 cdpierse

Hi. I'm running script_generation.ipynb on Google Drive as a Colab notebook. I amended the tokenizer and model lines above as you advise, but when I get to the cell that creates the ScriptData object I get this error:

AttributeError                            Traceback (most recent call last)
<ipython-input-9-64b0d907a510> in <module>()
----> 1 dataset = ScriptData(tokenizer= tokenizer, file_path= FILE_PATH )
      2 script_loader = DataLoader(dataset,batch_size=4,shuffle=True)

/content/language_modelling.py in __init__(self, tokenizer, file_path, block_size, overwrite_cache)
     39 
     40         block_size = block_size - (
---> 41             tokenizer.max_len - tokenizer.max_len_single_sentence
     42         )
     43 

AttributeError: 'GPT2Tokenizer' object has no attribute 'max_len'

Any help you can provide would be appreciated.

Feb 23 '21 11:02 jkurlandski01

Hi @jkurlandski01, back when I created this notebook I was using transformers with version 2.6.0 it appears that somewhere along the line in version 3.0 and now 4.0 that tokenizer.max_len has been replaced by tokenizer.model_max_length see (here)[https://huggingface.co/transformers/master/main_classes/tokenizer.html] for a description of all the tokenizers default attributes.

If you change that one line in the language_modelling.py to the new attribute name it should do the trick.

Feb 23 '21 12:02 cdpierse

Thanks for the quick reply!

Your fix worked, then in a later cell in the Colab notebook (for epoch in range(EPOCHS):...) it crashed in the first epoch with this error:

Your session crashed after using all available RAM. If you are interested in access to high-RAM runtimes, you may want to check out Colab Pro.

I think I'm not running this .ipynb file as expected. Should it work in Google's Colab notebooks? I can import only the .ipynb file into Colaboratory, but have to manually upload the script_buddy_v2 project's files. This doesn't seem right to me. When I tried to run it as a Jupyter notebook locally it just hanged on the import statements. What am I doing wrong?

Again, thanks for your help.

Feb 23 '21 12:02 jkurlandski01