Trying to get .ipynb to run
Hello! Thank you for posting. As per your blog, I opened up a Paperspace instance and cloned your repo, then ran:
git clone https://github.com/cdpierse/script_buddy_v2.git pip install -r requirements.txt
It all worked perfectly.
Then I opened script_generation.ipynb. I tried to run it, and I got this error:
OSError: Model name './storage/models/' was not found in tokenizers model name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). We assumed './storage/models/' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
Hey @Tylersuard I know what the issue is here, the model and tokenizer were stored locally at this stage of the process. Thankfully it's all up on the Transformers model hub now. If you make the change below It should all work.
Change
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)
model = GPT2LMHeadModel.from_pretrained(output_dir)
To:
tokenizer = GPT2Tokenizer.from_pretrained("cpierse/gpt2_film_scripts")
model = GPT2LMHeadModel.from_pretrained("cpierse/gpt2_film_scripts")
Hi. I'm running script_generation.ipynb on Google Drive as a Colab notebook. I amended the tokenizer and model lines above as you advise, but when I get to the cell that creates the ScriptData object I get this error:
AttributeError Traceback (most recent call last)
<ipython-input-9-64b0d907a510> in <module>()
----> 1 dataset = ScriptData(tokenizer= tokenizer, file_path= FILE_PATH )
2 script_loader = DataLoader(dataset,batch_size=4,shuffle=True)
/content/language_modelling.py in __init__(self, tokenizer, file_path, block_size, overwrite_cache)
39
40 block_size = block_size - (
---> 41 tokenizer.max_len - tokenizer.max_len_single_sentence
42 )
43
AttributeError: 'GPT2Tokenizer' object has no attribute 'max_len'
Any help you can provide would be appreciated.
Hi @jkurlandski01, back when I created this notebook I was using transformers with version 2.6.0 it appears that somewhere along the line in version 3.0 and now 4.0 that tokenizer.max_len has been replaced by tokenizer.model_max_length see (here)[https://huggingface.co/transformers/master/main_classes/tokenizer.html] for a description of all the tokenizers default attributes.
If you change that one line in the language_modelling.py to the new attribute name it should do the trick.
Thanks for the quick reply!
Your fix worked, then in a later cell in the Colab notebook (for epoch in range(EPOCHS):...) it crashed in the first epoch with this error:
Your session crashed after using all available RAM. If you are interested in access to high-RAM runtimes, you may want to check out Colab Pro.
I think I'm not running this .ipynb file as expected. Should it work in Google's Colab notebooks? I can import only the .ipynb file into Colaboratory, but have to manually upload the script_buddy_v2 project's files. This doesn't seem right to me. When I tried to run it as a Jupyter notebook locally it just hanged on the import statements. What am I doing wrong?
Again, thanks for your help.