captum icon indicating copy to clipboard operation
captum copied to clipboard

'pad' word/token used instead of special token '<pad>' (for padding) in tutorial

Open elixir-code opened this issue 4 years ago • 3 comments

📚 Documentation

Tutorial

https://captum.ai/tutorials/IMDB_TorchText_Interpret

Libraries used

  • captum 0.3.1
  • spacy 2.3.5
  • torch 1.7.1+cu101
  • torchtext 0.8.0

Issue

The token for word 'pad' is used instead of the special token '<pad>' for padding sequences which have length less than minimum length and also as reference token in the TokenReferenceBase object.

Lines of the code with the issue

  1. In the cell number 11 (In [11]):
PAD_IND = TEXT.vocab.stoi['pad']
  1. In cell number 14 (In [14]):
text += ['pad'] * (min_len - len(text))

Evidence to suggest that special token '<pad>' must be used instead of token 'pad'

In cell 11 (In [11]), we want to find the index of the token used for padding. Currently, the index computed as PAD_IND is 6978:

>>> PAD_IND = TEXT.vocab.stoi['pad']
>>> PAD_IND
6978

However, the index of token to be used padding is actually the index of token '<pad>' which is 1 as can be inferred by running the following snippets of codes:

In code snippet 11 (In [11]) from tutorial used for training CNN model:

>>> PAD_IND = TEXT.vocab.stoi[TEXT.pad_token]
>>> PAD_IND
1

In code snippet 5 (In [5]) from tutorial used for training CNN model:

...
>>> model.embedding.padding_idx
1

Also, from following code snippets from the tutorial https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb used for training the CNN models used in the tutorial with the issue, we can infer than the '<pad>' token instead of 'pad' token must be used.

In tutorial used for training CNN model, cell 7 (In [7]):

PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

Also, in the tutorial used for training CNN model, in cell 18 (In [18]), the '<pad>' token instead of 'pad' token is used for padding small sentences:

...
tokenized += ['<pad>'] * (min_len - len(tokenized))
...

Suggested changes in the tutorial:

In cell number 11 (In [11]), the changes to be made are:

- PAD_IND = TEXT.vocab.stoi['pad']
+ PAD_IND = TEXT.vocab.stoi[TEXT.pad_token]

In cell number 14 (In [14]), the changes to be made are:

    text = [tok.text for tok in nlp.tokenizer(sentence.lower())]
    if len(text) < min_len:
-        text += ['pad'] * (min_len - len(text))
+        text += ['<pad>'] * (min_len - len(text))

elixir-code avatar Apr 18 '21 14:04 elixir-code

However, if integrated gradients does not mandate that zero vector (or embedding of padding token) be used as reference token embedding, and allows the embedding of any random token can be used as reference, the above issue can be ignored.

elixir-code avatar Apr 18 '21 14:04 elixir-code

Hi @elixir-code ,

Captum's LayerIntegratedGradients implementation allows you to define a custom baseline if a zero vector does not fit your problem (see baselines in the arguments to the .attribute() method ).

Check out this tutorial for an example of how baselines can be specified.

Hope this helps

bilalsal avatar Apr 19 '21 20:04 bilalsal

Hi @elixir-code, Integrated gradients does not mandate zero vector as a reference / baseline. It can be anything of your choice. Good point regarding pad. To be consistent, I'll make updates based on your suggestions.

NarineK avatar Jul 07 '21 02:07 NarineK