Context Aware Tokenization
Is your feature request related to a problem? Please describe. Continuing this thread from https://github.com/microsoft/semantic-kernel/issues/9793#:~:text=Context%20Tokenization . Migrating from python to C# is difficult enough, supporting the same feature set but being more efficient is important for the migration story. One feature that is currently missing is context aware tokenization.
In python, HuggingFace tokenizers support this:
tokenizer = AutoTokenizer.from_pretrained('/path/to/tokenizer')
context = 'some context'
sentence = 'some sentence'
tokens = tokenizer(context, sentence)
Where the tokenizer will tokenize the sentence with the given context.
Describe the solution you'd like Add new API (including batch API) that support this:
class Tokenizer
{
public IReadOnlyList<int> EncodeToIds(string context, string text, bool considerPreTokenization = true, bool considerNormalization = true)
...
// relating to: https://github.com/dotnet/machinelearning/issues/7371
public void BatchTokenize<T, K>(ReadOnlySpan<string> contexts, ReadOnlySpan<string> texts, int maxTokenCount, Tensor<T> tokenIds, Tensor<K> mask)
where T: INumber<T>, K: INumber<K>
}
Describe alternatives you've considered Study the inner working of different tokenizers in the python implementation and see how to port the same functionality.
Additional context Continuing the thread related to tokenizers from https://github.com/microsoft/semantic-kernel/issues/9793
@tjwald would you be able to share docs / guidance that provide more info on this approach? Thanks.
This looks specific to Bert tokenizer.
https://huggingface.co/docs/transformers/v4.49.0/en/model_doc/bert#transformers.BertForNextSentencePrediction.forward.example
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
context = "Hugging Face provides powerful NLP tools."
tokens = tokenizer(context, sentence, padding=True, truncation=True, return_tensors="pt")
print(tokens)
this will produce the output:
{'input_ids': tensor([[ 101, 17662, 2227, 3640, 3928, 17953, 2361, 5906, 1012, 102,
19081, 3075, 2003, 4235, 2109, 2005, 2784, 4083, 1012, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
It is related to the batching too. @tjwald feel free to add any details here.
This looks specific to Bert tokenizer.
@tarekgh You are correct that BERT supports this, however it is part of the tokenizer generic API and it is has other implementations of this. Most of them are just to concat them, however there are those that don't [ex, ex]
I have collected a few documents mentioning this including the implementation of BERT from the huggingface repo.
https://medium.com/axinc-ai/how-tokenizer-parameters-impact-transformers-behavior-8be8030637c6
Question-Answering tasks: https://huggingface.co/docs/transformers/tasks/question_answering Multiple choice tasks: https://huggingface.co/docs/transformers/tasks/multiple_choice BERT preprocessing: https://huggingface.co/google-bert/bert-base-uncased#preprocessing BERT implementation: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/tokenization_bert.py -> build_inputs_with_special_tokens
It is related to the batching too
This feature should both be supported for single context-sentence pairs, but also for batches.
Thanks @tjwald!
I am not sure if you already saw we support build_inputs_with_special_tokens with Bert Tokenizer https://github.com/dotnet/machinelearning/blob/142d7f5afaaf8b156f71b629576badbe0b9048e9/src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs#L290
and we have overload work with span too https://github.com/dotnet/machinelearning/blob/142d7f5afaaf8b156f71b629576badbe0b9048e9/src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs#L339
@tarekgh Very cool! Is there documentation for this? I would add how to replicate the huggingface behaviours. Also how would truncation work?
Is there documentation for this?
We are in the process of updating all docs. Hopefully will be done soon.
@tarekgh I have just tried to use the API you suggested, and I don't see how to use it:
public IReadOnlyList<int> BuildInputsWithSpecialTokens(IEnumerable<int> tokenIds, IEnumerable<int>? additionalTokenIds = null)
To call this API I need to already tokenize it - so lets say I call it like so:
public IReadOnlyList<int> Tokenize(string context, string text)
{
var tokenizedContext = _tokenizer.EncodeToIds(context, _tokenizerOptions.MaxTokenLength, out _, out _);
var tokenizedText = _tokenizer.EncodeToIds(text, _tokenizerOptions.MaxTokenLength, out _, out _);
return _tokenizer.BuildInputsWithSpecialTokens(tokenizedContext, tokenizedText);
}
Well now I get:
[cls] [cls] <context tokens> [sep] [sep] [cls] <text tokens> [sep] [sep]
Which I definitly do not want 😄
I am now trying this:
public List<int> Tokenize(string text)
{
return (List<int>)_tokenizer.EncodeToIds(text, _tokenizerOptions.MaxTokenLength, out _, out _);
}
public IReadOnlyList<int> Tokenize(string context, string text)
{
List<int> tokenizedContext = Tokenize(context);
List<int> tokenizedText = Tokenize(context);
tokenizedContext.Add(tokenizer.SeparatorTokenId);
tokenizedContext.AddRange(tokenizedText.Skip(1));
return tokenizedContext;
}
Which obviously isn't ideal.
When encoding to Ids you can call the following overload and pass addSpecialTokens = false
https://github.com/dotnet/machinelearning/blob/81122c4c48d84fe2f49afd430a2c8fc214311baa/src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs#L158
The workaround works. I think we should create an interface for this so that users can do a capability check on the Tokenizer abstraction. @tarekgh @luisquintanilla WDYT?
The interface would be useful if most tokenizers supported functionality similar to the Bert tokenizer. However, users typically know which tokenizer they want to use and can directly call the additional functionality specific to it. We can revisit introducing an interface if we see sufficient demand.
I am trying to write a generic library, so I wouldn't know what tokenizer is being used. I guess I can create my own interface and force my users to write adapters.
I guess I can create my own interface and force my users to write adapters.
You know your scenario and goals better than anyone 😄. From what we’ve discussed so far, this is about Bert, where you can easily check the tokenizer type at runtime. But again, I’m just making an assumption here.
Closing for now - thanks for the feedback 😄